Hi, all,
I'd like to discuss an issue about getting the minimal restart_lsn for WAL segments removal during checkpoint. The discussion [1] fixed the issue with the unexpected removal of old WAL segments after checkpoint, followed by an immediate restart. The commit 2090edc6f32f652a2c introduced a change that the minimal restart_lsn is obtained at the start of checkpoint creation. If a replication slot is created and performs a WAL reservation concurrently, the WAL segment contains the new slot's restart_lsn could be removed by the ongoing checkpoint. In the attached patch I add a perl test to reproduce this scenario.
Additionally, while studying the InvalidatePossiblyObsoleteSlot(), I noticed a behavioral difference between PG15 (and earlier) and PG16 (and later). In PG15 and earlier, while attempting to acquire a slot, if the slot's restart_lsn advanced to be greater than oldestLSN, the slot would not be marked invalid. Starting in PG16, whether a slot is marked invalid is determined solely based on initial_restart_lsn, even if the slot's restart_lsn advances above oldestLSN while waiting, the slot will still be marked invalid. The initial_restart_lsn is recorded to report the correct invalidation cause (see discussion [2]), but why not decide whether to mark the slot as invalid based on the slot's current restart_lsn? If a slot's restart_lsn has already advanced sufficiently, shouldn't we refrain from invalidating it?
[1]: https://www.postgresql.org/message-id/flat/1d12d2-67235980-35-19a406a0%4063439497 <https://www.postgresql.org/message-id/flat/1d12d2-67235980-35-19a406a0%4063439497 >
[2]: https://www.postgresql.org/message-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal <https://www.postgresql.org/message-id/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal >
Looking forward to your feedback.
Best Regards,
suyu.cmj