| From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
|---|---|
| To: | Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |
| Date: | 2026-03-24 09:15:41 |
| Message-ID: | CAHGQGwGETy+7Gv5=6kfYucQUy81SwGQbYr=nftHg7ZeqP07sBA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
> > Hi Fujii-san,
> >
> > I tried reproducing the wait scenario as you mentioned, but could not
> > reproduce it.
> > Steps I followed:
> > 1) Place a debugger in the slotsync worker and hold it at
> > fetch_remote_slots() ... -> libpqsrv_get_result()
> > 2) Kill the primary.
> > 3) Triggered promotion of the standby and release debugger from slotsync worker.
> >
> > The slot sync worker stops when the promotion is triggered and then
> > restarts, but fails to connect to the primary. The promotion happens
> > immediately.
> > ```
> > LOG: received promote request
> > LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> > 0.02 s, elapsed: 4574.89 s
> > LOG: last completed transaction was at log time 2026-03-23
> > 17:13:15.782313+05:30
> > LOG: replication slot synchronization worker will stop because
> > promotion is triggered
> > LOG: slot sync worker started
> > ERROR: synchronization worker "slotsync worker" could not connect to
> > the primary server: connection to server at "127.0.0.1", port 9933
> > failed: Connection refused
> > Is the server running on that host and accepting TCP/IP connections?
> > ```
> >
> > I’ll debug this further to understand it better.
> > In the meantime, please let me know if I’m missing any step, or if you
> > followed a specific setup/script to reproduce this scenario.
>
> Thanks for testing!
>
> If you killed the primary with a signal like SIGTERM, an RST packet might have
> been sent to the slotsync worker at that moment. That allowed the worker to
> detect the connection loss and exited the wait state, so promotion could
> complete as expected.
>
> To reproduce the issue, you'll need a scenario where the worker cannot detect
> the connection loss. For example, you could block network traffic (e.g., with
> iptables) between the primary and the slotsync worker. The key is to create
> a situation where the worker remains stuck waiting for input for a long time.
Here's one way to reproduce the issue using iptables:
----------------------------------------------------
[Set up slot synchronization environment]
initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot
pg_recvlogical --create-slot -S logical_slot -P pgoutput
--enable-failover -d postgres
psql -c "CREATE PUBLICATION mypub"
pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
-h 127.0.0.1
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
EOF
pg_ctl -D sby1 start
psql -c "SELECT pg_logical_emit_message(true, 'abc', 'xyz')"
[Block network traffic used by slot synchronization]
su -
iptables -A INPUT -p tcp --sport 5432 -j DROP
iptables -A OUTPUT -p tcp --dport 5432 -j DROP
[Promote the standby]
# wait a few seconds
pg_ctl -D sby1 promote
----------------------------------------------------
In my tests on master, promotion got stuck in this scenario.
With the patch, promotion completed promptly.
After testing, you can remove the network block with:
iptables -D INPUT -p tcp --sport 5432 -j DROP
iptables -D OUTPUT -p tcp --dport 5432 -j DROP
Regards,
--
Fujii Masao
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dean Rasheed | 2026-03-24 09:18:06 | Re: Allow to collect statistics on virtual generated columns |
| Previous Message | Wenbo Lin | 2026-03-24 08:56:44 | Re: [WiP] B-tree page merge during vacuum to reduce index bloat |