Re: Timeline switching with partial WAL records can break replica recovery

From: Alyona Vinter <dlaaren8(at)gmail(dot)com>
To: Nataliia <k(dot)natalissa(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Timeline switching with partial WAL records can break replica recovery
Date: 2025-09-10 09:07:35
Message-ID: CAGWv16JqHWZRnWUcTTEMF=0f+zqpboU4t+eKMANeTJObecYPXA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

I've noticed an issue with pg_rewind caused by my patches.

Some logs for issue demonstration:
pg_rewind: Source timeline history:
pg_rewind: 1: 0/00000000 - 0/03002048
pg_rewind: 2: 0/03002048 - 0/00000000
pg_rewind: Target timeline history:
pg_rewind: 1: 0/00000000 - 0/00000000
pg_rewind: servers diverged at WAL location 0/03002048 on timeline 1
pg_rewind: error: could not find previous WAL record at 0/03002048: invalid
record length at 0/03002048: expected at least 24, got 0

When a common timeline ends with an overwritten contrecord, the divergence
point may not point to the start of a valid WAL record on the target,
causing errors and making rewind impossible.
To handle this case, I suggest looking for a checkpoint preceding the
divergence point starting from the last checkpoint on the target rather
than from the divergence point itself when the common timeline is
unfinished on the target. This ensures we always begin from a known-valid
position in WAL.

I'd appreciate any feedback!

Best Regards,
Alyona Vinter

Attachment Content-Type Size
v3-0001-Handle-WAL-timeline-switches-with-incomplete-records.patch text/x-patch 10.0 KB
v3-0002-Removed-assertion-in-walsummarizer.patch text/x-patch 1.2 KB
v3-0003-Handle-rewind-failure-when-a-timeline-ends-with-an-overwritten-contrecord.patch text/x-patch 5.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-09-10 09:10:55 Re: pgsql: Preserve conflict-relevant data during logical replication.
Previous Message Amit Kapila 2025-09-10 08:39:34 Re: Conflict detection for update_deleted in logical replication