| From: | Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
|---|---|
| To: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |
| Date: | 2026-03-24 04:01:28 |
| Message-ID: | CABdArM4a8am4_PYhpse1UwoP2pbh5BzLbTmaePoDMsbFOeJZ-A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > >
> > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > >
> >
> > IIRC, this same signal is used for both the backend executing
> > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > exit and error_out backend. Using SIGTERM for backend could result in
> > its exit.
>
> Why do we want the backend running pg_sync_replication_slots() to throw
> an error here, rather than just exit? If emitting an error is really required,
> another option would be to store the process type in SlotSyncCtx and send
> different signals accordingly, for example, SIGTERM for the slotsync worker
> and another signal for a backend. But it seems simpler and sufficient to have
> the backend exit in this case as well.
>
>
> > Also, we want the last slotsync cycle to complete before
> > promotion so that chances of subscribers that do failover/switchover
> > to new primary has better chances of finding failover slots
> > sync-ready.
>
> I'm not sure how much this behavior helps in failover/switchover scenarios.
> But the main issue is that if a primary crash triggers standby promotion,
> that last slotsync cycle can get stuck waiting for input from the primary,
> which delays promotion. IOW, failover time can become unnecessarily long
> due to the slotsync worker. I'd like to address that problem.
>
Hi Fujii-san,
I tried reproducing the wait scenario as you mentioned, but could not
reproduce it.
Steps I followed:
1) Place a debugger in the slotsync worker and hold it at
fetch_remote_slots() ... -> libpqsrv_get_result()
2) Kill the primary.
3) Triggered promotion of the standby and release debugger from slotsync worker.
The slot sync worker stops when the promotion is triggered and then
restarts, but fails to connect to the primary. The promotion happens
immediately.
```
LOG: received promote request
LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
0.02 s, elapsed: 4574.89 s
LOG: last completed transaction was at log time 2026-03-23
17:13:15.782313+05:30
LOG: replication slot synchronization worker will stop because
promotion is triggered
LOG: slot sync worker started
ERROR: synchronization worker "slotsync worker" could not connect to
the primary server: connection to server at "127.0.0.1", port 9933
failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```
I’ll debug this further to understand it better.
In the meantime, please let me know if I’m missing any step, or if you
followed a specific setup/script to reproduce this scenario.
--
Thanks,
Nisha
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Japin Li | 2026-03-24 04:12:19 | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Previous Message | Chao Li | 2026-03-24 03:31:02 | Re: [Proposal] Adding Log File Capability to pg_createsubscriber |