From: | BharatDB <bharatdbpg(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [PATCH] Fix pg_rewind false positives caused by shutdown-only WAL |
Date: | 2025-09-30 05:30:48 |
Message-ID: | CAAh00EQjzwShaJkmWaLmQ4GmMPJEfmsNeLj3XStqyjeeDWoCbw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear Robert.
With regard to the mail, I firstly apologise for the confusion created
regarding authorship. The patch was developed as part of my work in my
official email id and the patch was committed under my personal GitHub
account during collaborative development. Going forward, I will make sure
the email address, signature, and patch author information are consistent
to avoid any ambiguity in future. On reviewing your objections posted on
September 25, I understood the following points :
-
1. Control file changes:
You are correct that the control file always changes on shutdown, and that
pg_rewind cannot simply ignore those updates. My earlier patch proposal did
not address that, and I now understand why the server itself would reject a
mismatch here.
-
2. Other WAL records (RUNNING_XACTS):
I see now that a clean shutdown generates both RUNNING_XACTS and
CHECKPOINT_SHUTDOWN. My patch only skipped over the latter, so in practice
rewind would still be triggered incorrectly. I will extend the logic to
also consider this sequence properly.
-
3. Server-side consistency:
As noted, even if pg_rewind skips shutdown-only WAL records, the restarted
old primary can still fail due to control file divergence (infinite loop
issue). That means it needs a more holistic fix that considers both
pg_rewind and server startup behavior.
-
4. RMID verification:
I did not guard the filtering with an XLogRecGetRmid() check. I’ll fix this
to avoid misclassification, following the walsummarizer.c example as you
suggested.
-
Plan forward:
-
Revise the patch so that pg_rewind correctly checks RMIDs and handles
both RUNNING_XACTS + CHECKPOINT_SHUTDOWN sequences, not just shutdown
checkpoints.
-
Investigate whether control file normalization is required (or whether
server-side startup logic also needs adjustments) so that an old primary
can rejoin cleanly without looping.
-
Ensure consistent patch authorship (my name + email will match the
commit and submission).
-
Add regression coverage under src/bin/pg_rewind/t/ to reproduce this
clean-shutdown failover scenario automatically.
- I’ll prepare and post a new version of the patch with these corrections.
Looking forward for more suggestions from you.
- Thank you for carefully reviewing and pointing out both the technical and
process issues.
-
Best regards,
Soumya
On Mon, Sep 29, 2025 at 7:47 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Sep 29, 2025 at 6:30 AM BharatDB <bharatdbpg(at)gmail(dot)com> wrote:
> > Dear Srinath,
>
> This is a really weird email. First, it doesn't address my objections
> from September 25th. Second, the email is from BharatDB
> <bharatdbpg(at)gmail(dot)com>, the email signature says it's from "Soumya",
> and the patch within is from manimurali1993
> <manimurali1993(at)gamil(dot)com>.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
From | Date | Subject | |
---|---|---|---|
Next Message | shveta malik | 2025-09-30 05:57:59 | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Previous Message | Peter Eisentraut | 2025-09-30 05:24:49 | Re: Fixing MSVC's inability to detect elog(ERROR) does not return |