Re: Timeline switching with partial WAL records can break replica recovery

From: D Laaren <dlaaren8(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Timeline switching with partial WAL records can break replica recovery
Date: 2025-06-17 11:59:14
Message-ID: CAGWv16+hDSNThZeNf0qvUHHpmLE04jurrqHN7BbV1_uSN6tq+w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've done more research and identified that replicas enter
an indefinite loop in the 'XLogReadPage' function.
The loop works as follows:
0. timeline N contains a partially written record with LSN = targetRecPtr;
1. In 'XLogReadPage' we attempt to read the next page, which has to
contain the rest of the unfinished record;
2. In 'WaitForWALToBecomeAvailable' walrcv is requested to fetch
records starting from LSN = targetRecPtr on timeline N + 1;
3. Walrcv retrieves data up to the end of page containing the end of
timeline N + 1;
4. Then, in 'WaitForWALToBecomeAvailable', replica switches to
XLOG_FROM_ARCHIVE state, and the function returns true;
5. Execution continues in 'XLogReadPage';
6. The page at addr = targetPagePtr is checked for validity, but we
get an 'invalid magic number' error because walrcv hasn't retrieved
this page;
7. Execution jumps to 'next_record_is_invalid' label;
8. Since we are in StandBy mode, the process retries from the beginning.

See the attachments for more colorful illustration this time =)

From my point of view, the first solution which I described in my
previous message still seems like a good choice.

I've also found the current solution in commit [1]. With all due
respect, but it seems to treat the symptom rather than the underlying
issue.

[1]
https://github.com/postgres/postgres/commit/6cf1647d87e7cd423d71525a8759b75c4e4a47ec

Attachment Content-Type Size
how_replicas_enter_indefinite_loop_1.jpg image/jpeg 1.5 MB
how_replicas_enter_indefinite_loop_2.jpg image/jpeg 1.3 MB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rahila Syed 2025-06-17 12:13:24 Re: add function for creating/attaching hash table in DSM registry
Previous Message Peter Eisentraut 2025-06-17 11:42:55 Re: wrong comments in rewriteTargetListIU