[BUG] recovery of prepared transactions during promotion can fail

From: Julian Markwort <julian(dot)markwort(at)cybertec(dot)at>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [BUG] recovery of prepared transactions during promotion can fail
Date: 2023-06-16 14:27:40
Message-ID: 743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey everyone,

I've discovered a serious bug that leads to a server crash upon promoting an instance that crashed previously and did
recovery in standby mode.

The bug is present in PostgreSQL versions 13 and 14 (and in earlier versions, though it doesn't manifest itself so
catastrophically).
The circumstances to trigger the bug are as follows:
- postgresql is configured for hot_standby, archiving, and prepared transactions
- prepare a transaction
- crash postgresql
- create standby.signal file
- start postgresql, wait for recovery to finish
- promote

The promotion will fail with a FATAL error, stating that "requested WAL segment .* has already been removed".
The FATAL error causes the startup process to exit, so postmaster shuts down again.

Here's an exemplary log output, maybe this helps people to find this issue when they search for it online:

LOG: consistent recovery state reached at 0/15D8AB0
LOG: database system is ready to accept read only connections
LOG: received promote request
LOG: redo done at 0/15D89B8
LOG: last completed transaction was at log time 2023-06-16 13:09:53.71118+02
LOG: selected new timeline ID: 2
LOG: archive recovery complete
FATAL: requested WAL segment pg_wal/000000010000000000000001 has already been removed
LOG: startup process (PID 1650358) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down

The cause of this failure is an oversight (rather obvious in hindsight):
The renaming of the WAL file (that was last written to before the crash happened) to .partial is done *before* PostgreSQL
might have to read this very file to recover prepared transactions from it.
The relevant function calls here are durable_rename() and RecoverPreparedTransactions() in xlog.c .

Note that it is important that the PREPARE entry is in the WAL file that PostgreSQL is writing to prior to the inital
crash.
This has happened repeatedly in production already with a customer that uses prepared transactions quite frequently.
I assume that this has happened for others too, but the circumstances of the crash and the cause are very dubious, and
troubleshooting it is pretty difficult.

This behaviour has - apparently unintentionally - been fixed in PG 15 and upwards (see commit 811051c ), as part of a
general restructure and reorganization of this portion of xlog.c (see commit 6df1543 ).

Furthermore, it seems this behaviour does not appear in PG 12 and older, due to another possible bug:
In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead() before reading WAL in XlogReadTwoPhaseData() in
twophase.c .
In the older releases (PG <= 12), this reset is not done, so the requested LSN containing the prepared transaction can
(by happy coincidence) be read from in-memory buffers, and PostgreSQL consequently manages to come up just fine (as the
WAL has already been read into buffers prior to the .partial rename).
If the older releases also where to properly reset the XLogReaderState, they would also fail to find the LSN on disk, and
hence PostgreSQL would crash again.

I've attached patches for PG 14 and PG 13 that mimic the change in PG15 (commit 811051c ) and reorder the crucial events,
placing the recovery of prepared transactions *before* renaming the file.
I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can be used to verify that promote after recovery
with prepared transactions works.

A note for myself in the future and whomever may find it useful:
The test can be copied to src/test/recovery/t/ and selectively run (after you've ./configure'd for TAP testing and
compiled everything) from within the src/test/recovery directory using something like:
make check PROVE_TESTS='t/PG_geq_12_promote_prepare_xact.pl'

My humble opinion is that this fix should be backpatched to PG 14 and PG 13.
It's debatable whether the fix needs to be brought back to 12 and older also, as those do not exhibit this issue, but the
order of renaming is still wrong.
I'm not sure if there could be cases where the in-memory buffers of the walreader are too small to cover a whole WAL
file.
There could also be other issues from operations that require reading WAL that happen after the .partial rename, I
haven't checked in depth what else happens in the affected codepath.
Please let me know if you think this should also be fixed in PG 12 and earlier, so I can produce the patches for those
versions as well.

Kind regards
Julian

Attachment Content-Type Size
PG_13_fix_prepared_xact_recovery.patch text/x-patch 3.2 KB
PG_14_fix_prepared_xact_recovery.patch text/x-patch 3.1 KB
PG_geq_12_promote_prepare_xact.pl application/x-perl 1.0 KB
PG_leq_11_promote_prepare_xact.pl application/x-perl 1.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2023-06-16 14:44:37 Re: Pluggable toaster
Previous Message Melih Mutlu 2023-06-16 14:03:14 Parent/child context relation in pg_get_backend_memory_contexts()