| From: | cca5507 <cca5507(at)qq(dot)com> |
|---|---|
| To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
| Cc: | suryapoondla4 <suryapoondla4(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: [BUG] Take a long time to reach consistent after pg_rewind |
| Date: | 2026-06-11 07:50:19 |
| Message-ID: | tencent_62676AC7522E64ACF1360FAAEC1842250B09@qq.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
> As I understand it, the root of the problem is that
> get_current_wal_insert_lsn() returns the next WAL insertion position,
> while minRecoveryPoint is expected to point just past the end of the
> last required record. At positions immediately following a WAL page or
> segment header, those locations can be logically equivalent but
> numerically different.
Yes.
> I think pg_rewind is probably using the insert LSN because it wants to
> choose a conservative position as far ahead as possible. It might be
> possible to use the flush LSN if the copying logic is carefully
> arranged, but I would prefer to keep using the insert LSN if we can.
The insert LSN is not crash safe, is this really make sense to use it? For
example, the primary has insert LSN 1000, flush LSN 500, the standby
sets minRecoveryPoint to 1000, and then the primary crash and restart.
The primary now only has LSN 500, but the standby cannot reach
consistent until LSN 1000. This doesn’t make sense to me.
> I also think it is unfortunate to require client-side tools to be
> aware of the exact "end of last record + 1" representation. If both
> representations are logically equivalent at those positions, I would
> rather have the recovery side accept and normalize them than require
> every tool to produce exactly the expected form.
Sounds reasonable.
> So I wonder whether we should instead normalize minRecoveryPoint when
> reading it from the control file, so that these equivalent
> representations are treated as the same position. That would also make
> the recovery side a bit more robust.
>
> For example, something like this:
>
> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index 73b78a83fa7..660c681795d 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -955,6 +955,20 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
> {
> minRecoveryPoint = ControlFile->minRecoveryPoint;
> minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
> +
> + /*
> + * minRecoveryPoint in the control file is expected to point to the
> + * location immediately following the end of the target
> + * record. However, some tools may record a location immediately after
> + * a WAL segment or page header when the target position is at the
> + * beginning of a segment or page. Normalize such values here before
> + * further processing.
> + */
> + if (XLogSegmentOffset(minRecoveryPoint, wal_segment_size)
> + == SizeOfXLogLongPHD)
> + minRecoveryPoint -= SizeOfXLogLongPHD;
> + else if (minRecoveryPoint % XLOG_BLCKSZ == SizeOfXLogShortPHD)
> + minRecoveryPoint -= SizeOfXLogShortPHD;
> }
> else
> {
I had thought about this before. To update minRecoveryPoint in place, I think we
should make sure that it won't cause any side effects. That means we need to
check every places we use minRecoveryPoint. That's why the v1 patch introduces
GetEffectiveMinRecoveryPoint() rather than updates it in place.
--
Regards,
ChangAo Chen
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Akshay Joshi | 2026-06-11 07:48:07 | Re: [PATCH] Add pg_get_table_ddl() to reconstruct CREATE TABLE statements |