Quick Links

Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date:	2026-03-24 06:00:20
Message-ID:	CAHGQGwFKULfab1NH1+_-+GdpJ8itUaKGU0_4Uwcr-y0MLZchyQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
> Hi Fujii-san,
>
> I tried reproducing the wait scenario as you mentioned, but could not
> reproduce it.
> Steps I followed:
> 1) Place a debugger in the slotsync worker and hold it at
> fetch_remote_slots() ... -> libpqsrv_get_result()
> 2) Kill the primary.
> 3) Triggered promotion of the standby and release debugger from slotsync worker.
>
> The slot sync worker stops when the promotion is triggered and then
> restarts, but fails to connect to the primary. The promotion happens
> immediately.
> ```
> LOG: received promote request
> LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> 0.02 s, elapsed: 4574.89 s
> LOG: last completed transaction was at log time 2026-03-23
> 17:13:15.782313+05:30
> LOG: replication slot synchronization worker will stop because
> promotion is triggered
> LOG: slot sync worker started
> ERROR: synchronization worker "slotsync worker" could not connect to
> the primary server: connection to server at "127.0.0.1", port 9933
> failed: Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> ```
>
> I’ll debug this further to understand it better.
> In the meantime, please let me know if I’m missing any step, or if you
> followed a specific setup/script to reproduce this scenario.

Thanks for testing!

If you killed the primary with a signal like SIGTERM, an RST packet might have
been sent to the slotsync worker at that moment. That allowed the worker to
detect the connection loss and exited the wait state, so promotion could
complete as expected.

To reproduce the issue, you'll need a scenario where the worker cannot detect
the connection loss. For example, you could block network traffic (e.g., with
iptables) between the primary and the slotsync worker. The key is to create
a situation where the worker remains stuck waiting for input for a long time.

Regards,

--
Fujii Masao

In response to

Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? at 2026-03-24 04:01:28 from Nisha Moond

Responses

Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? at 2026-03-24 09:15:41 from Fujii Masao

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Lukas Fittl	2026-03-24 06:03:16	Re: Stack-based tracking of per-node WAL/buffer usage
Previous Message	Nishant Sharma	2026-03-24 05:59:31	Re: [BUG] CRASH: ECPGprepared_statement() and ECPGdeallocate_all() when connection is NULL