From: | D Laaren <dlaaren8(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Timeline switching with partial WAL records can break replica recovery |
Date: | 2025-06-17 11:59:14 |
Message-ID: | CAGWv16+hDSNThZeNf0qvUHHpmLE04jurrqHN7BbV1_uSN6tq+w@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I've done more research and identified that replicas enter
an indefinite loop in the 'XLogReadPage' function.
The loop works as follows:
0. timeline N contains a partially written record with LSN = targetRecPtr;
1. In 'XLogReadPage' we attempt to read the next page, which has to
contain the rest of the unfinished record;
2. In 'WaitForWALToBecomeAvailable' walrcv is requested to fetch
records starting from LSN = targetRecPtr on timeline N + 1;
3. Walrcv retrieves data up to the end of page containing the end of
timeline N + 1;
4. Then, in 'WaitForWALToBecomeAvailable', replica switches to
XLOG_FROM_ARCHIVE state, and the function returns true;
5. Execution continues in 'XLogReadPage';
6. The page at addr = targetPagePtr is checked for validity, but we
get an 'invalid magic number' error because walrcv hasn't retrieved
this page;
7. Execution jumps to 'next_record_is_invalid' label;
8. Since we are in StandBy mode, the process retries from the beginning.
See the attachments for more colorful illustration this time =)
From my point of view, the first solution which I described in my
previous message still seems like a good choice.
I've also found the current solution in commit [1]. With all due
respect, but it seems to treat the symptom rather than the underlying
issue.
[1]
https://github.com/postgres/postgres/commit/6cf1647d87e7cd423d71525a8759b75c4e4a47ec
Attachment | Content-Type | Size |
---|---|---|
how_replicas_enter_indefinite_loop_1.jpg | image/jpeg | 1.5 MB |
how_replicas_enter_indefinite_loop_2.jpg | image/jpeg | 1.3 MB |
From | Date | Subject | |
---|---|---|---|
Next Message | Rahila Syed | 2025-06-17 12:13:24 | Re: add function for creating/attaching hash table in DSM registry |
Previous Message | Peter Eisentraut | 2025-06-17 11:42:55 | Re: wrong comments in rewriteTargetListIU |