Incorrect snapshots while promoting hot standby node when 2PC is used

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Incorrect snapshots while promoting hot standby node when 2PC is used
Date: 2021-04-22 20:36:03
Message-ID: 20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Michael Paquier (running locally I think), and subsequently Thomas Munro
(noticing [1]), privately reported that they noticed an assertion failure in
GetSnapshotData(). Both reasonably were wondering if that's related to the
snapshot scalability patches.

Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)

> The failure was triggered by one of the new TAP tests,
> 023_pitr_prepared_xact.pl, after recovering a 2PC transaction that
> used a transaction ID that matches with RecentXmin:
> (gdb) p RecentXmin
> $1 = 588
> (gdb) p TransactionXmin
> $2 = 589

I tried for a while to reproduce that, but couldn't. Adding a bunch of
debugging output and increasing the log level shows the problem pretty clearly
however, just not tripping any asserts:

2021-04-21 17:55:54.287 PDT [1829098] [unknown] LOG: setting xmin to 588
...
2021-04-21 17:55:54.377 PDT [1829049] DEBUG: removing all KnownAssignedXids
2021-04-21 17:55:54.377 PDT [1829049] DEBUG: release all standby locks
...
2021-04-21 17:55:54.396 PDT [1829100] [unknown] LOG: setting xmin to 589
...
2021-04-21 17:55:55.379 PDT [1829048] LOG: database system is ready to accept connections
...
2021-04-21 17:55:55.380 PDT [1829120] LOG: setting xmin to 588
...
2021-04-21 17:55:55.386 PDT [1829126] [unknown] LOG: setting xmin to 588
2021-04-21 17:55:55.387 PDT [1829126] 023_pitr_prepared_xact.pl LOG: statement: COMMIT PREPARED 'fooinsert';
...
2021-04-21 17:55:55.428 PDT [1829128] [unknown] LOG: setting xmin to 589

So there's clear proof for xmin going from 588 to 589 and back and
forth.

After looking some more the bug isn't even that subtle. And definitely not new
- likely it exists since the introduction of hot standby.

The sequence in StartupXLOG() leading to the issue is the following:

1) RecoverPreparedTransactions();
2) ShutdownRecoveryTransactionEnvironment();
3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;

Because 2) resets the KnownAssignedXids machinery, snapshots that happen
between 2) and 3) will not actually look at the procarray to compute
snapshots, as that only happens within

snapshot->takenDuringRecovery = RecoveryInProgress();
if (!snapshot->takenDuringRecovery)

and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
RECOVERY_STATE_DONE, which is set in 3).

So snapshots within that window will always be "empty", i.e. xmin is
latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll
look at the procarray, which then leads xmin going back to 588.

I think that this can lead to data corruption, because a too new xmin horizon
could lead to rows from a prepared transaction getting hint bitted as dead (or
perhaps even pruned, although that's probably harder to hit). Due to the too
new xmin horizon we won't treat rows by the prepared xact as in-progress, and
TransactionIdDidCommit() will return false, as the transaction didn't commit
yet. Which afaict can result in row versions created by the prepared
transaction being invisible even after the prepared transaction commits.

Without prepared transaction there probably isn't an issue, because there
shouldn't be any other in-progress xids at that point?

I think to fix the issue we'd have to move
ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
= RECOVERY_STATE_DONE.

The acquisition of ProcArrayLock() in
ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
should prevent the data from being removed between the RecoveryInProgress()
and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().

I haven't yet figured out whether there would be a problem with deferring the
other tasks in ShutdownRecoveryTransactionEnvironment() until after
RECOVERY_STATE_DONE.

I think we ought to introduce assertions that have a higher chance to
catch cases like this. The window to hit the new assertion that caused
Michael to hit this is pretty darn small (xmin needs to move backwards
between two snapshot computations inside a single transaction). I
*think* we can safely assert that xmin doesn't move backwards globally,
if we store it as a 64bit xid, and don't perform that check in
walsender?

Greetings,

Andres Freund

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Honza Horak 2021-04-22 20:39:53 Re: Do we work with LLVM 12 on s390x?
Previous Message Tom Lane 2021-04-22 20:28:36 Re: posgres 12 bug (partitioned table)