Re: [BUG] Take a long time to reach consistent after pg_rewind

From: surya poondla <suryapoondla4(at)gmail(dot)com>
To: cca5507 <cca5507(at)qq(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind
Date: 2026-06-29 18:53:27
Message-ID: CAOVWO5pj-BOAtSCkuGLCA3HLdFJJ_3hawZ9JLvF2BckRj+15rQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi ChangAo,

Thanks for the v3, the commit message, in-line comment, and the
rewind_source.h note all look good

On the test front: I don't think a hang-detection test can be made
reliable. The bug requires the source's insert LSN to be exactly
segment_boundary + SizeOfXLogLongPHD with no further WAL activity, but
bgwriter's periodic LogStandbySnapshot emits a RUNNING_XACTS which can
advance the insert LSN
nondeterministically between pg_switch_wal() and the rewind. In my
reproduction bgwriter ended the hang after ~9s; that's the kind of timing
we don't want in CI.

The deterministic alternative is to parse pg_controldata on the target
after pg_rewind and assert minRecoveryPoint does not land
at "boundary + SizeOfXLogLongPHD". That's a direct check on the patched
behavior independent of source idleness or replay
timing. It doesn't exercise the integration property that the rewound node
reaches consistency without further upstream WAL.
So I am not sure if this testcase is a complete one in our scenario.

Regards,
Surya Poondla

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2026-06-29 19:58:15 Re: PostgreSQL select-only CTE removal is too aggressive?
Previous Message Robert Haas 2026-06-29 18:17:02 Re: use of SPI by postgresImportForeignStatistics