Re: BF mamba failure

From: Kouber Saparev <kouber(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BF mamba failure
Date: 2025-09-16 11:45:03
Message-ID: CAN4RuQvQ3ATcYvfTR1LzJnUJXpo_F8mgz-+WxoZsyusLLmCwYA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

На пт, 12.09.2025 г. в 3:37 Michael Paquier <michael(at)paquier(dot)xyz> написа:

> Okay, the bit about the cascading standby is a useful piece of
> information. Do you have some data about the relation reported in the
> error message this is choking on based on its OID? Is this actively
> used in read-only workloads, with the relation looked at in the
> cascading standby?

This objoid=767325170 is non-existent, nor was it present in the previous
shutdown (objoid=4169049057). So I guess it is something quasi-temporary
that has been dropped afterwards.

> Is hot_standby_feedback enabled in the cascading
> standby?

Yes, hot_standby_feedback = on.

> With which process has this cascading standby been created?
> Does the workload of the primary involve a high consumption of OIDs
> for relations, say many temporary tables?
>

Yes, we have around 150 entries added and deleted per second in pg_class,
and around 800 in pg_attribute. So something is actively creating and
dropping tables all the time.

>
> Another thing that may help is the WAL record history. Are you for
> example seeing attempts to drop twice the same pgstats entry in WAL
> records? Perhaps the origin of the problem is in this area. A
> refcount of 2 is relevant, of course.
>

How could we dig into this, i.e. inspecting such attempts in the WAL
records?

>
> I have looked a bit around but nothing has popped up here, so as far
> as I know you seem to be the only one impacted by that.
>
> 1d6a03ea4146 and dc5f9054186a are in 17.3, so perhaps something is
> still off with the drop when applied to cascading standbys. A vital
> piece of information may also be with "generation", which would show
> up in the error report thanks to bdda6ba30cbe, and that's included in
> 17.6. A first thing would be to update to 17.6 and see how things
> go for these cascading setups. If it takes a couple of weeks to have
> one report, we have a hunt that may take a few months at least, except
> if somebody is able to find out the race condition here, me or someone
> else.
>
>
Is it enough to upgrade the replicas or we need to upgrade the primary as
well?

--
Kouber

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Core Studios Inc. 2025-09-16 11:57:54 Re: Incorrect result of bitmap heap scan.
Previous Message Ashutosh Sharma 2025-09-16 11:41:58 Re: Improve pg_sync_replication_slots() to wait for primary to advance