Re: Fix crash during recovery when redo segment is missing

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Fix crash during recovery when redo segment is missing
Date: 2025-12-04 07:00:23
Message-ID: CALdSSPieG0wqQnjxyF+U23QQfZfpGnUFvKsJR0pxAbUoY=Z0vA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 4 Dec 2025 at 11:37, Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com> wrote:
>
> Apologies, I missed attaching the patch earlier. Please find the v2
> version attached.
>
> Best Regards,
> Nitin Jadhav
> Azure Database for PostgreSQL
> Microsoft
>
> On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav
> <nitinjadhavpostgres(at)gmail(dot)com> wrote:
> >
> > The patch wasn’t applying cleanly on master, so I’ve rebased it and
> > also added it to the PG19‑4 CommitFest:
> > https://commitfest.postgresql.org/patch/6279/
> > Please review and share your feedback.
> >
> > Best Regards,
> > Nitin Jadhav
> > Azure Database for PostgreSQL
> > Microsoft
> >
> > Best Regards,
> > Nitin Jadhav
> > Azure Database for PostgreSQL
> > Microsoft
> >
> >
> > On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav
> > <nitinjadhavpostgres(at)gmail(dot)com> wrote:
> > >
> > > Hi,
> > >
> > > In [1], Andres reported a bug where PostgreSQL crashes during recovery
> > > if the segment containing the redo pointer does not exist. I have
> > > attempted to address this issue and I am sharing a patch for the same.
> > >
> > > The problem was that PostgreSQL did not PANIC when the redo LSN and
> > > checkpoint LSN were in separate segments, and the file containing the
> > > redo LSN was missing, leading to a crash. Andres has provided a
> > > detailed analysis of the behavior across different settings and
> > > versions. Please refer to [1] for more information. This issue arises
> > > because PostgreSQL does not PANIC initially.
> > >
> > > The issue was resolved by ensuring that the REDO location exists once
> > > we successfully read the checkpoint record in InitWalRecovery(). This
> > > prevents control from reaching PerformWalRecovery() unless the WAL
> > > file containing the redo record exists. A new test script,
> > > 044_redo_segment_missing.pl, has been added to validate this. To
> > > populate the WAL file with a redo record different from the WAL file
> > > with the checkpoint record, I wait for the checkpoint start message
> > > and then issue a pg_switch_wal(), which should occur before the
> > > completion of the checkpoint. Then, I crash the server, and during the
> > > restart, it should log an appropriate error indicating that it could
> > > not find the redo location. Please let me know if there is a better
> > > way to reproduce this behavior. I have tested and verified this with
> > > the various scenarios Andres pointed out in [1]. Please note that this
> > > patch does not address error checking in StartupXLOG(),
> > > CreateCheckPoint(), etc., nor does it focus on cleaning up existing
> > > code.
> > >
> > > Attaching the patch. Please review and share your feedback. Thanks to
> > > Andres for spotting the bug and providing the detailed report [1].
> > >
> > > [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de
> > >
> > > Best Regards,
> > > Nitin Jadhav
> > > Azure Database for PostgreSQL
> > > Microsoft

Hi!

1) Please do not top-post on these lists

2) I did not get the exact reason this thread is different from [0]

3)

Your tests are using a busy loop with usleep which nowadays is
considered as bad practice. There 3 of such places, and I think the
first two of them
can be replaced with injection point wait.

> +# Wait for the checkpoint to complete
> +my $checkpoint_complete = 0;
> +foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) {
> + if ($node->log_contains("checkpoint complete", $logstart)) {
> + $checkpoint_complete = 1;
> + last;
> + }
> + usleep(100_000);
> +}

4) There are comments from Robert, which are not covered [1].

[0] https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de
[1] https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com

--
Best regards,
Kirill Reshke

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brandon Tat 2025-12-04 07:01:31 Re: Improve error reporting in 027_stream_regress test
Previous Message jian he 2025-12-04 06:51:54 Re: alter check constraint enforceability