From: | Kouber Saparev <kouber(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: BF mamba failure |
Date: | 2025-09-16 11:45:03 |
Message-ID: | CAN4RuQvQ3ATcYvfTR1LzJnUJXpo_F8mgz-+WxoZsyusLLmCwYA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
На пт, 12.09.2025 г. в 3:37 Michael Paquier <michael(at)paquier(dot)xyz> написа:
> Okay, the bit about the cascading standby is a useful piece of
> information. Do you have some data about the relation reported in the
> error message this is choking on based on its OID? Is this actively
> used in read-only workloads, with the relation looked at in the
> cascading standby?
This objoid=767325170 is non-existent, nor was it present in the previous
shutdown (objoid=4169049057). So I guess it is something quasi-temporary
that has been dropped afterwards.
> Is hot_standby_feedback enabled in the cascading
> standby?
Yes, hot_standby_feedback = on.
> With which process has this cascading standby been created?
> Does the workload of the primary involve a high consumption of OIDs
> for relations, say many temporary tables?
>
Yes, we have around 150 entries added and deleted per second in pg_class,
and around 800 in pg_attribute. So something is actively creating and
dropping tables all the time.
>
> Another thing that may help is the WAL record history. Are you for
> example seeing attempts to drop twice the same pgstats entry in WAL
> records? Perhaps the origin of the problem is in this area. A
> refcount of 2 is relevant, of course.
>
How could we dig into this, i.e. inspecting such attempts in the WAL
records?
>
> I have looked a bit around but nothing has popped up here, so as far
> as I know you seem to be the only one impacted by that.
>
> 1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is
> still off with the drop when applied to cascading standbys. A vital
> piece of information may also be with "generation", which would show
> up in the error report thanks to bdda6ba30cbe, and that's included in
> 17.6. A first thing would be to update to 17.6 and see how things
> go for these cascading setups. If it takes a couple of weeks to have
> one report, we have a hunt that may take a few months at least, except
> if somebody is able to find out the race condition here, me or someone
> else.
>
>
Is it enough to upgrade the replicas or we need to upgrade the primary as
well?
--
Kouber
From | Date | Subject | |
---|---|---|---|
Next Message | Core Studios Inc. | 2025-09-16 11:57:54 | Re: Incorrect result of bitmap heap scan. |
Previous Message | Ashutosh Sharma | 2025-09-16 11:41:58 | Re: Improve pg_sync_replication_slots() to wait for primary to advance |