Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Incorrect snapshots while promoting hot standby node when 2PC is used
Date: 2021-05-26 07:57:31
Message-ID: YK3/ayaKnpLAMYoZ@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 22, 2021 at 01:36:03PM -0700, Andres Freund wrote:
> The sequence in StartupXLOG() leading to the issue is the following:
>
> 1) RecoverPreparedTransactions();
> 2) ShutdownRecoveryTransactionEnvironment();
> 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
>
> Because 2) resets the KnownAssignedXids machinery, snapshots that happen
> between 2) and 3) will not actually look at the procarray to compute
> snapshots, as that only happens within
>
> snapshot->takenDuringRecovery = RecoveryInProgress();
> if (!snapshot->takenDuringRecovery)
>
> and RecoveryInProgress() checks XLogCtl->SharedRecoveryState !=
> RECOVERY_STATE_DONE, which is set in 3).

Oh, indeed. It is easy to see RecentXmin jumping back-and-worth while
running 023_pitr_prepared_xact.pl with a small sleep added just after
ShutdownRecoveryTransactionEnvironment().

> Without prepared transaction there probably isn't an issue, because there
> shouldn't be any other in-progress xids at that point?

Yes, there should not be any as far as I recall. 2PC is kind of
special with its fake ProcArray entries.

> I think to fix the issue we'd have to move
> ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState
> = RECOVERY_STATE_DONE.
>
> The acquisition of ProcArrayLock() in
> ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds()
> should prevent the data from being removed between the RecoveryInProgress()
> and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData().
>
> I haven't yet figured out whether there would be a problem with deferring the
> other tasks in ShutdownRecoveryTransactionEnvironment() until after
> RECOVERY_STATE_DONE.

Hmm. This would mean releasing all the exclusive locks tracked by a
standby, as of StandbyReleaseAllLocks(), after opening the instance
for writes after a promotion. I don't think that's unsafe, but it
would be intrusive.

Anyway, isn't the issue ExpireAllKnownAssignedTransactionIds() itself,
where we should try to not wipe out the 2PC entries to make sure that
all those snapshots still see the 2PC transactions as something to
count on? I am attaching a crude patch to show the idea.

Thoughts?
--
Michael

Attachment Content-Type Size
2pc-recovery-snap-v1.patch text/x-diff 1.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-05-26 08:00:22 Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay
Previous Message houzj.fnst@fujitsu.com 2021-05-26 07:37:15 Fix typo: multiple tuple => tuples