snapshot recovery conflict despite hot_standby_feedback set to on

From: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
To: <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: snapshot recovery conflict despite hot_standby_feedback set to on
Date: 2022-01-28 16:17:20
Message-ID: 9aae233b-72ec-b1b8-5716-2a092909f89f@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

TL;DR:

* Symptom: snapshot recovery conflict on standby despite
hot_standby_feedback set to on.
* Cause: The bug is caused by incorrect 32-bit comparison of xmin in
the case of Btree page reuse WAL record. This should be fixed in
version 14 based on commit e5d8a99903 which introduces 64-bit
comparison.
* Mitigation: This bug occurs if there are old deleted Btree pages, so
a mitigation is to rebuild the index to remove the old deleted pages.

_Detailed Explanation:__
_
We have been able to get the xids being compared in
GetConflictingVirtualXIDs() when a snapshot recovery conflict is
occurring, as well as the associated WAL record replay being blocked, we
got:

* pxmin: 3882856499
* limitXmin: 1557468379
* WAL record replay being blocked: Btree/REUSE_PAGE

"Btree/REUSE_PAGE" means that on the primary a Btree page deleted some
time ago has been reused.
The limitXmin being used in that case is the xid when the page has been
deleted (+1) on the primary: 1557468379

_The first question is then, why a conflict has been recorded?__
_
Logically a conflict is being recorded for a backend if its xmin is <=
the limitxmin.

But wait, we have pxmin (3882856499) > limitXmin (1557468379) so why is
a conflict being recorded??

The function that is doing the comparison is TransactionIdFollows()
(being called in GetConflictingVirtualXIDs()), and the comparison is
done with a (int32) casting:

diff = (int32) (id1 - id2);
return (diff > 0);

As 3882856499 - 1557468379 is >= 2^31 + 1 then the diff is < 0 (due to
the cast) so that TransactionIdFollows(3882856499, 1557468379) is
wrongly returning that 3882856499 does not follow 1557468379, adding the
associated backend to the list of conflicting backends.

*This is the bug.*

_The second question is: does this big difference makes sense?_

Yes, it does. The Btree page has been deleted a long time ago and has
been reused a lot of transactions later. Logically that could happen.

_More details about the bug circumstances:_

It turned out that this bug is manifesting when there is another
replication slot (means not linked to this standby) on the primary with
a relatively old xmin.

Indeed, first, let's recall that a Btree deleted page is being reused
(on the primary) if (see _bt_page_recyclable()):

TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin)

But TransactionIdPrecedes() is also using an (int32) casting comparison,
so it could also return wrong result if the difference is >= 2^31 + 1.

In our case the comparison was done (on the primary) with something like
TransactionIdPrecedes( 1.5B, 3.8B) , so it is wrongly returning than
1.5B does not precede 3.8B (as the difference is >= 2^31 + 1).

As a consequence, that Btree deleted page is wrongly *not reused*: this
is the bug fixed by Peter in PG 14 (e5d8a99903 commit).

So the Btree deleted page is not being reused and as a consequence there
is no "wrong" snapshot recovery conflict on the Standby.

So, in that case, the bug mentioned above is somehow "protecting" us
from the "false" snapshot recovery conflict on the Standby.

This is when, the things could change if another replication slot would
have been also present on the primary (with a relatively old xmin and
then changing the RecentGlobalXmin).

Indeed, with another replication slot (that is not linked to the
standby) on the primary reporting a relatively old xmin (so that this
xmin is the oldest TransactionXmin across all running transactions) then
acting as the RecentGlobalXmin.

As a matter of fact if this xmin coming from the other replication slot
is old enough (means the difference with then btpo.xact is < 2^31 + 1)
then TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin) is
returning the correct result and the Btree deleted page is now reused.

But as the difference between the btpo.xact and the backend xmin on the
Standby is >= 2 ^31 +1 then the false snapshot recovery conflict
mentioned initially is triggered.

Peter's commit e5d8a99903 added in PG 14 should be fixing this bug (as
it makes use of FullTransactionId for the Btree deleted page) even if
the original intend was to avoid "leaking" of Btree deleted pages.

_Recommendation:_

Given the fact that the bug described here is occurring at very rare
circumstances we don't think this is worth Peter's commit e5d8a99903 to
be back patched.

The reason for this bug report is more to describe a scenario where it
could happen in case someone is seeing snapshot recovery conflict
despite hot_standby_feedback set to on.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-01-28 17:22:06 BUG #17387: Working in PG13 but not in PGH14: array_agg(RECORD)
Previous Message Tom Lane 2022-01-28 16:15:32 Re: BUG #17382: When vacuum full or vacuumdb - F is executed, a large number of empty files will be generated in the