Newly created replication slot may be invalidated by checkpoint

From: "suyu(dot)cmj" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>
To: "aekorotkov" <aekorotkov(at)gmail(dot)com>, "amit(dot)kapila16" <amit(dot)kapila16(at)gmail(dot)com>, "tomas" <tomas(at)vondra(dot)me>, "v(dot)davydov" <v(dot)davydov(at)postgrespro(dot)ru>, "michael" <michael(at)paquier(dot)xyz>, "bharath(dot)rupireddyforpostgres" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Newly created replication slot may be invalidated by checkpoint
Date: 2025-09-15 14:41:41
Message-ID: 5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan.cmj@alibaba-inc.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, all,
I'd like to discuss an issue about getting the minimal restart_lsn for WAL segments removal during checkpoint. The discussion [1] fixed the issue with the unexpected removal of old WAL segments after checkpoint, followed by an immediate restart. The commit 2090edc6f32f652a2c introduced a change that the minimal restart_lsn is obtained at the start of checkpoint creation. If a replication slot is created and performs a WAL reservation concurrently, the WAL segment contains the new slot's restart_lsn could be removed by the ongoing checkpoint. In the attached patch I add a perl test to reproduce this scenario.
Additionally, while studying the InvalidatePossiblyObsoleteSlot(), I noticed a behavioral difference between PG15 (and earlier) and PG16 (and later). In PG15 and earlier, while attempting to acquire a slot, if the slot's restart_lsn advanced to be greater than oldestLSN, the slot would not be marked invalid. Starting in PG16, whether a slot is marked invalid is determined solely based on initial_restart_lsn, even if the slot's restart_lsn advances above oldestLSN while waiting, the slot will still be marked invalid. The initial_restart_lsn is recorded to report the correct invalidation cause (see discussion [2]), but why not decide whether to mark the slot as invalid based on the slot's current restart_lsn? If a slot's restart_lsn has already advanced sufficiently, shouldn't we refrain from invalidating it?
[1]: https://www.postgresql.org/message-id/flat/1d12d2-67235980-35-19a406a0%4063439497 <https://www.postgresql.org/message-id/flat/1d12d2-67235980-35-19a406a0%4063439497 >
[2]: https://www.postgresql.org/message-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal <https://www.postgresql.org/message-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal >
Looking forward to your feedback.
Best Regards,
suyu.cmj

Attachment Content-Type Size
0001-Newly-created-replication-slot-may-be-invalidated.patch application/octet-stream 5.0 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-09-15 14:46:45 Re: BUG #18959: Name collisions of expression indexes during parallel Index creations on a pratitioned table.
Previous Message Peter Eisentraut 2025-09-15 14:36:59 Re: expand virtual generated columns in get_relation_constraints()