From: | "Vitaly Davydov" <v(dot)davydov(at)postgrespro(dot)ru> |
---|---|
To: | suyu(dot)cmj <mengjuan(dot)cmj(at)alibaba-inc(dot)com> |
Cc: | "aekorotkov" <aekorotkov(at)gmail(dot)com>, amit(dot)kapila16 <amit(dot)kapila16(at)gmail(dot)com>, "tomas" <tomas(at)vondra(dot)me>, "michael" <michael(at)paquier(dot)xyz>, bharath(dot)rupireddyforpostgres <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Newly created replication slot may be invalidated by checkpoint |
Date: | 2025-09-17 10:49:39 |
Message-ID: | 15922-68ca9280-4f-37de2c40@245457797 |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi suyu.cmj
> The commit 2090edc6f32f652a2c introduced a change that the
> minimal restart_lsn is obtained at the start of checkpoint creation. If a
> replication slot is created and performs a WAL reservation concurrently, the
> WAL segment contains the new slot's restart_lsn could be removed by the ongoing
> checkpoint.
Thank you for reporting this issue. I agree, the issue with slot invalidation
seems to take place in REL_17_STABLE and earlier, but it is not reproducible in
18+ versions because of different implementation. The problem may appear if
the first persistent slot is created during checkpoint, when slot's oldest lsn
is invalid. I'm not sure how it works when some other persistent slots exist.
Probably, invalidation is still possible if the reservation happens with lsn
older than the oldest lsn of existing slots.
In 17 and earlier verions, when checkpoint is started in takes slot's oldest lsn
using XLogGetReplicationSlotMinimumLSN(). This value will be used later in WAL
segments removal. If a new slot reserved the WAL between getting of slots'
oldest lsn and WAL removal, it may be invalidated. It happens because
ReplicationSlotReserveWal() checks XLogCtl->lastRemovedSegNo but the segments
are not yet removed. There is a subtle thing, when the wal reservation completes
at the same time when the checkpointer is between KeepLogSeg and
RemoveOldXlogFiles where XLogCtl->lastRemovedSegNo is updated. The slot will not
be invalidated but the segments, reserved by the new slot, may be removed, I guess.
In 17 and earlier we tried to create a compatible solution, when oldest lsn was
taken before slot syncing to disk. In the master branch we added a new
last_saved_restart_lsn into ReplicationSlot structure which seems to be a better
solution.
I prepared a simple fix [1] for 17 and earlier versions. It seems it fixes the
problem with first persistent slot creation. I also think, it should work as it
was before the patch that added this bug.
I also did some changes in the original test script, for 17 ([2]) and 18 ([3])
versions.
I continue to investigate and test it.
[1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[2] v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch
[3] v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch
With best regards,
Vitaly
Attachment | Content-Type | Size |
---|---|---|
v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch | text/x-patch | 4.0 KB |
0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch | text/x-patch | 2.0 KB |
v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch | text/x-patch | 4.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Etsuro Fujita | 2025-09-17 10:53:05 | Re: someone else to do the list of acknowledgments |
Previous Message | David Rowley | 2025-09-17 10:31:44 | Re: Make TID Scans recalculate the TIDs less often |