Re: [BUG] recovery of prepared transactions during promotion can fail

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: julian(dot)markwort(at)cybertec(dot)at
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUG] recovery of prepared transactions during promotion can fail
Date: 2023-06-19 05:25:14
Message-ID: 20230619.142514.373396981484508204.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the report, reproducer and the patches.

At Fri, 16 Jun 2023 16:27:40 +0200, Julian Markwort <julian(dot)markwort(at)cybertec(dot)at> wrote in
> - prepare a transaction
> - crash postgresql
> - create standby.signal file
> - start postgresql, wait for recovery to finish
> - promote
..
> The promotion will fail with a FATAL error, stating that "requested WAL segment .* has already been removed".
> The FATAL error causes the startup process to exit, so postmaster shuts down again.
>
> Here's an exemplary log output, maybe this helps people to find this issue when they search for it online:

> LOG: redo done at 0/15D89B8
> LOG: last completed transaction was at log time 2023-06-16 13:09:53.71118+02
> LOG: selected new timeline ID: 2
> LOG: archive recovery complete
> FATAL: requested WAL segment pg_wal/000000010000000000000001 has already been removed
> LOG: startup process (PID 1650358) exited with exit code 1

Reproduced here.

> The cause of this failure is an oversight (rather obvious in hindsight):
> The renaming of the WAL file (that was last written to before the crash happened) to .partial is done *before* PostgreSQL
> might have to read this very file to recover prepared transactions from it.
> The relevant function calls here are durable_rename() and RecoverPreparedTransactions() in xlog.c .
> This behaviour has - apparently unintentionally - been fixed in PG 15 and upwards (see commit 811051c ), as part of a
> general restructure and reorganization of this portion of xlog.c (see commit 6df1543 ).

I think so, the reordering might have done for some other reasons, though.

> Furthermore, it seems this behaviour does not appear in PG 12 and older, due to another possible bug:

<snip>...

> In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead()
> before reading WAL in XlogReadTwoPhaseData() in twophase.c .

I arraived at the same conclusion.

> In the older releases (PG <= 12), this reset is not done, so the requested LSN containing the prepared transaction can
> (by happy coincidence) be read from in-memory buffers, and PostgreSQL consequently manages to come up just fine (as the
> WAL has already been read into buffers prior to the .partial rename).
> If the older releases also where to properly reset the XLogReaderState, they would also fail to find the LSN on disk, and
> hence PostgreSQL would crash again.

From the perspective of loading WAL for prepared transactions, the
current code in those versions seems fine. Although I suspect Windows
may not like to rename currently-open segments, it's likely acceptable
as the current test set operates without issue.. (I didn't tested this.)

> I've attached patches for PG 14 and PG 13 that mimic the change in PG15 (commit 811051c ) and reorder the crucial events,
> placing the recovery of prepared transactions *before* renaming the file.

It appears to move the correct part of the code to the proper
location, modifying the steps to align with later versions.

> I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can be used to verify that promote after recovery
> with prepared transactions works.

It effectively detects the bug, though it can't be directly used in
the tree as-is. I'm unsure whether we need this in the tree, though.

> My humble opinion is that this fix should be backpatched to PG 14 and PG 13.

I agree with you.

> It's debatable whether the fix needs to be brought back to 12 and older also, as those do not exhibit this issue, but the
> order of renaming is still wrong.
> I'm not sure if there could be cases where the in-memory buffers of the walreader are too small to cover a whole WAL
> file.
> There could also be other issues from operations that require reading WAL that happen after the .partial rename, I
> haven't checked in depth what else happens in the affected codepath.
> Please let me know if you think this should also be fixed in PG 12 and earlier, so I can produce the patches for those
> versions as well.

There's no immediate need to change the versions. However, I would
prefer to backpatch them to the older versions for the following
reasons.

1. Applying this eases future backpatching in this area, if any.

2. I have reservations about renaming possibly-open WAL segments.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-06-19 05:37:48 Re: Allow pg_archivecleanup to remove backup history files
Previous Message Michael Paquier 2023-06-19 05:24:44 Re: [BUG] recovery of prepared transactions during promotion can fail