Re: Fix crash during recovery when redo segment is missing

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Fix crash during recovery when redo segment is missing
Date: 2025-12-04 08:48:59
Message-ID: aTFK-5qXT4gGbiei@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 04, 2025 at 12:00:23PM +0500, Kirill Reshke wrote:
> Your tests are using a busy loop with usleep which nowadays is
> considered as bad practice. There 3 of such places, and I think the
> first two of them
> can be replaced with injection point wait.

(This fell off my radar, apologies about that.)

The problem with this test is not related to its use of sleeps, which
is perfectly fine to check the start or end timings of a checkpoint
depending on the contents of the logs. It has two issues.

One first issue is that the test is unnecessary long, taking more than
30s to finish because it relies on checkpoint_timeout to kick a
checkpoint. This could use a background psql object to kick a
checkpoint to accelerate the whole. So the test is sitting idle for a
long time, doing nothing.

Your intuition about injection points is right, though, but it points
to a second problem: the test is not reliable because we could finish
the checkpoint *before* the WAL segment is switched, and we expect the
WAL segment switch to happen while the checkpoint is processing. If
you want to make that deterministic, having a wait in the middle of
checkpointing would make the test actually test what it should. In
this case the test would randomly die on its "redo record wal file is
the same" message. That's OK to prove the point of the initial patch,
but it's not acceptable for a test that could be added to the tree.

An update of src/test/recovery/meson.build to add the new test is
also required.

Anyway, I think that the patch is overdoing it in issuing a PANIC in
this case, going backwards with the other thread from [0]: a FATAL
would be perfectly OK, like in the backup_label path because your
manipulations of WAL segment missing are indeed possible, as the test
posted is proving. And there have been many arguments in the past
about performing recovery without a backup_label, as well. And that
would make the test something that we could use, because no backtrace
on buildfarm hosts or disk bloat.

If we do not re-check in InitWalRecovery() that the redo record is
around, the startup process happily goes down to PerformWalRecovery()
in an everything-is-fine mode, initializing a bunch of states, to then
crash due to a pointer dereference while attempting to read the record
from the redo LSN, which does not exist. This is not right. So,
oops. I would see an argument good enough for a backpatch when it
comes to crash recovery, because we actually don't know what has
happened in this case. Or not based on the lack of complaints on the
matter over the years.

So I'd argue about updating the patch among these lines, instead:
- Switch the PANIC to a FATAL, to inform about the pilot error. This
addresses the pointer dereference with WAL replay going crazy for the
redo LSN.
- Add a test, with an injection point after checkpoint startup, based
on a manual checkpoint done in a backgroud psql, without a
checkpoint_timeout to make the test faster.
- We could be careful first based on the lack of complaints, doing
that extra check only on HEAD, for v19.

> 4) There are comments from Robert, which are not covered [1].
>
> [0] https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de
> [1] https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com

I get the argument about the spaghetti code in general, however the
code in question here deals with the WAL replay initialization, for a
backup_label vs a non-backup_label path. Perhaps I'm more used to
this area than others, but it's not that much pasta to me.

I don't see a point in moving the memcpy() and the wasShutdown parts
as proposed in the patch, by the way. The PANIC would block things
in its else block. Let's limit outselves to the extra ReadRecord()
check for the redo record when we find a checkpoint record.

I'd actually wonder if we should not lower the existing PANIC to a
FATAL if ReadCheckpointRecord() fails, as well. The result would be
the same for operators, without the need to deal with backtrace
cleanups after a crash. And we still have the error message that
tells us what's going on.

Any thoughts or opinions from others?
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Chao Li 2025-12-04 08:40:18 Re: tuple radix sort