| From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
|---|---|
| To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, suyu(dot)cmj <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, tomas <tomas(at)vondra(dot)me>, michael <michael(at)paquier(dot)xyz>, bharath(dot)rupireddyforpostgres <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
| Subject: | RE: Newly created replication slot may be invalidated by checkpoint |
| Date: | 2026-01-21 14:46:26 |
| Message-ID: | TY7PR01MB14554DBE84290130EB421DD28F596A@TY7PR01MB14554.jpnprd01.prod.outlook.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Dear Hou,
I reproduced the issue by [1] and confirmed the issue was resolved by your
patch. Here are my comments.
1.
Replication slot cannot be invalidated even when only 0002 was applied. Can you
modify the workload to cause invalidation without the fix? Or is it impossible?
2.
It might be matter of taste, but I do not like substituting to the argument;
basically it's immutable. How about attached? It can indicate that minimum safe
LSN would be used for the restart_lsn.
[1] Reproducer:
0.
As the preparation, made sure the standby node has removed at least one WAL
record. This avoids to read the oldest segment from the pg_wal directory.
On my env, I executed below SQLs on primary, then run CHECKPOINT command
on the standby.
```
INSERT INTO foo VALUES (1);
SELECT pg_switch_wal();
INSERT INTO foo VALUES (1);
SELECT pg_switch_wal();
SELECT pg_switch_wal();
SELECT * FROM pg_replication_slot_advance('failover_slot', pg_current_wal_lsn());
SELECT * FROM pg_replication_slot_advance('failover_slot', pg_current_wal_lsn());
SELECT * FROM pg_replication_slot_advance('failover_slot', pg_current_wal_lsn());
...
CHECKPOINT ;
```
1.
Advanced WAL segments on primary then run CHECKPOINT. I executed below lines:
```
INSERT INTO foo VALUES (1);
SELECT pg_switch_wal();
INSERT INTO foo VALUES (1);
SELECT pg_switch_wal();
SELECT pg_switch_wal();
CHECKPOINT ;
```
2.
Connected to the standby and set a breakpoint at reserve_wal_for_local_slot().
3.
Ran pg_sync_replication_slots() on standby. It would stop because of the breakpoint.
4.
Attached a checkpointer on standby and set a breakpoint at KeepLogSeg().
5.
Run CHECKPOINT command on standby. It triggered a restartpoint and would stop
at KeepLogSeg().
6.
Resumed working the checkpointer till end of KeepLogSeg(). The removal target
was determined here.
7.
Resumed working the backend ReplicationSlotRelease() in synchronize_one_slot().
8.
Resomed working the checkpointer. It tried to kill the backend to invalidate
the replication slot.
```
LOG: terminating process 3152756 to release replication slot "failover_slot"
DETAIL: The slot's restart_lsn 0/05000028 exceeds the limit by 33554392 bytes.
HINT: You might need to increase "max_slot_wal_keep_size".
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
| Attachment | Content-Type | Size |
|---|---|---|
| kuroda.diffs | application/octet-stream | 2.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alvaro Herrera | 2026-01-21 14:51:40 | Re: log_min_messages per backend type |
| Previous Message | zengman | 2026-01-21 14:37:23 | [PATCH] Align verify_heapam.c error message offset with test expectations |