pg_rewind WAL segments deletion pitfall

From: Полина Бунгина <bungina(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: pg_rewind WAL segments deletion pitfall
Date: 2022-08-23 15:46:30
Message-ID: CAAtGL4AhzmBRsEsaDdz7065T+k+BscNadfTqP1NcPmsqwA5HBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello,

It seems for me that there is currently a pitfall in the pg_rewind
implementation.

Imagine the following situation:

There is a cluster consisting of a primary with the following
configuration: wal_level=‘replica’, archive_mode=‘on’ and a replica.

1. The primary that is not fast enough in archiving WAL segments (e.g.
network issues, high CPU/Disk load...)
2. The primary fails
3. The replica is promoted
4. We are not lucky enough, the new and the old primary’s timelines
diverged, we need to run pg_rewind
5. We are even less lucky: the old primary still has some WAL segments
with .ready signal files that were generated before the point of divergence
and were not archived. (e.g. 000000020004D20200000095.done,
000000020004D20200000096.ready, 000000020004D20200000097.ready,
000000020004D20200000098.ready)
6. The promoted primary runs for some time and recycles the old WAL
segments.
7. We revive the old primary and try to rewind it
8. When pg_rewind finished successfully, we see that the WAL segments
with .ready files are removed, because they were already absent on the
promoted replica. We end up in a situation where we completely lose some
WAL segments, even though we had a clear sign that they were not
archived and
more importantly, pg_rewind read these segments while collecting
information about the data blocks.
9. The old primary fails to start because of the missing WAL segments
(more strictly, the records between the last common checkpoint and the
point of divergence) with the following log record: "ERROR: requested WAL
segment 000000020004D20200000096 has already been removed"

In this situation, after pg_rewind:
archived:

000000020004D20200000095

000000020004D20200000099.partial

000000030004D20200000099

the following segments are lost:

000000020004D20200000096

000000020004D20200000097

000000020004D20200000098

Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
of creating filemap for WALs? Can it preserve the WAL segments that contain
those potentially lost records (> the last common checkpoint and < the
point of divergence) on the target? (see the patch attached)

If I am missing something however, please correct me or explain why it is
not possible to implement this straightforward solution.

Thank you,

Polina Bungina

Attachment Content-Type Size
v1-0001-pg_rewind-wal-deletion.patch application/octet-stream 5.4 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Daniele Varrazzo 2022-08-24 00:21:31 Re: Regression in pipeline mode in libpq 14.5
Previous Message Amit Kapila 2022-08-23 14:26:41 Re: Excessive number of replication slots for 12->14 logical replication

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-08-23 15:55:11 Re: SQL/JSON features for v15
Previous Message Andrew Dunstan 2022-08-23 15:29:39 Re: SQL/JSON features for v15