Re: [BUG] recovery of prepared transactions during promotion can fail

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Julian Markwort <julian(dot)markwort(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUG] recovery of prepared transactions during promotion can fail
Date: 2023-06-19 05:24:44
Message-ID: ZI/mnFy2QG4ULkPF@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 16, 2023 at 04:27:40PM +0200, Julian Markwort wrote:
> I've discovered a serious bug that leads to a server crash upon
> promoting an instance that crashed previously and did recovery in
> standby mode.

Reproduced here, for the versions mentioned.

> The bug is present in PostgreSQL versions 13 and 14 (and in earlier
> versions, though it doesn't manifest itself so catastrophically).
> The circumstances to trigger the bug are as follows:
> - postgresql is configured for hot_standby, archiving, and prepared transactions
> - prepare a transaction
> - crash postgresql
> - create standby.signal file
> - start postgresql, wait for recovery to finish
> - promote

hot_standby allows one to run queries on a standby running recovery,
so it seems to me that it does not really matter. Enabling archiving
is the critical piece. The nodes set in the TAP test 009_twophase.pl
don't use any kind of archiving. But once it is enabled on London the
first promotion command of the test fails the same way as you report.
# Setup london node
my $node_london = get_new_node("london");
-$node_london->init(allows_streaming => 1);
+# Archiving is used to force tests with .partial segment creations
+# done at the end of recovery.
+$node_london->init(allows_streaming => 1, has_archiving => 1);

Enabling the archiving does not impact any of the tests, as we don't
use restore_command during recovery and only rely on streaming.

> The cause of this failure is an oversight (rather obvious in
> hindsight): The renaming of the WAL file (that was last written to
> before the crash happened) to .partial is done *before* PostgreSQL
> might have to read this very file to recover prepared transactions
> from it. The relevant function calls here are durable_rename() and
> RecoverPreparedTransactions() in xlog.c.
>
> Note that it is important that the PREPARE entry is in the WAL file
> that PostgreSQL is writing to prior to the inital crash.
> This has happened repeatedly in production already with a customer
> that uses prepared transactions quite frequently. I assume that
> this has happened for others too, but the circumstances of the crash
> and the cause are very dubious, and troubleshooting it is pretty
> difficult.

I guess that this is a possibility yes. I have not heard directly
about such a report, but perhaps that's just because few people use
2PC.

> This behaviour has - apparently unintentionally - been fixed in PG
> 15 and upwards (see commit 811051c ), as part of a general
> restructure and reorganization of this portion of xlog.c (see commit
> 6df1543 ).
>
> Furthermore, it seems this behaviour does not appear in PG 12 and
> older, due to another possible bug: In PG 13 and newer, the
> XLogReaderState is reset in XLogBeginRead() before reading WAL in
> XlogReadTwoPhaseData() in twophase.c .
> In the older releases (PG <= 12), this reset is not done, so the
> requested LSN containing the prepared transaction can (by happy
> coincidence) be read from in-memory buffers, and PostgreSQL
> consequently manages to come up just fine (as the WAL has already
> been read into buffers prior to the .partial rename). If the older
> releases also where to properly reset the XLogReaderState, they
> would also fail to find the LSN on disk, and hence PostgreSQL would
> crash again.

That's debatable, but I think that I would let v12 and v11 be as they
are. v11 is going to be end-of-life soon and we did not have any
complains on this matter as far as I know, so there is a risk of
breaking something upon its last release. (Got some, Err..
experiences with that in the past). On REL_11_STABLE, note for
example the slight difference with the handling of
recovery_end_command, where we rely on InRecovery rather than
ArchiveRecoveryRequested. REL_12_STABLE is in a more consistent shape
than v11 regarding that.

> I've attached patches for PG 14 and PG 13 that mimic the change in
> PG15 (commit 811051c ) and reorder the crucial events, placing the
> recovery of prepared transactions *before* renaming the file.

Yes, I think that's OK. I would like to add two things to your
proposal for all the existing branches.
- Addition of a comment where RecoverPreparedTransactions() is called
at the end of recovery to tell that we'd better do that before working
on the last partial segment of the old timeline.
- Enforce the use of archiving in 009_twophase.pl.

> My humble opinion is that this fix should be backpatched to PG 14
> and PG 13. It's debatable whether the fix needs to be brought back
> to 12 and older also, as those do not exhibit this issue, but the
> order of renaming is still wrong.

Yeah, I'd rather wait for somebody to complain about that. And v11 is
not worth taking risks with at this time of the year, IMHO.

With your fix included, the patch for REL_14_STABLE would be like the
attached. Is that OK for you?
--
Michael

Attachment Content-Type Size
0001-Fix-promotion-with-prepared-transactions-and-the-las.patch text/x-diff 4.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2023-06-19 05:25:14 Re: [BUG] recovery of prepared transactions during promotion can fail
Previous Message shveta malik 2023-06-19 03:39:27 Re: Support logical replication of DDLs