| From: | Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
|---|---|
| To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |
| Date: | 2026-03-27 04:57:43 |
| Message-ID: | CABdArM6Mfhk2+9TVR_D3cgfPWPfHDuZEg7MOc5KqULwt0OcQUg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Mar 27, 2026 at 9:28 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
> >
> > On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > > >
> > > > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > > > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > > > > >
> > > > >
> > > > > IIRC, this same signal is used for both the backend executing
> > > > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > > > exit and error_out backend. Using SIGTERM for backend could result in
> > > > > its exit.
> > > >
> > > > Why do we want the backend running pg_sync_replication_slots() to throw
> > > > an error here, rather than just exit? If emitting an error is really required,
> > > > another option would be to store the process type in SlotSyncCtx and send
> > > > different signals accordingly, for example, SIGTERM for the slotsync worker
> > > > and another signal for a backend. But it seems simpler and sufficient to have
> > > > the backend exit in this case as well.
> > > >
> > >
> > > As we want to retain the existing behavior for API, so instead of
> > > using two signals, we can achieve what you intend to achieve by one
> > > signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> > > ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> > > slotsync worker/backend via SendProcSignal. Then in
> > > procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> > > HandleSlotSyncInterrupt() will set the InterruptPending and
> > > SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> > > specific function based on the flag and do what we currently do in
> > > ProcessSlotSyncInterrupts. I think this should address the issue you
> > > are worried about.
> > >
> >
> > +1
> > Retaining the current behavior for the API backend keeps it consistent
> > with other backends that continue after promotion.
> >
> > In the reproduced case, the worker (or API backend) is waiting in:
> > libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
> > When SIGUSR1 is received, it only sets the latch but does not mark any
> > interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
> > effectively a no-op, and the process goes back to waiting. So, control
> > never returns to the slotsync code path, and we cannot rely on
> > stopSignaled to handle exit/error separately.
> > Only SIGTERM works here because its handler sets
> > INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
> > break the loop. The other signals like SIGUSR1 or SIGINT do not do
> > this, so simply using another signal might not solve the API error
> > handling case.
> >
> > I’ve implemented the above approach suggested by Amit in the attached
> > patch and verified it for both worker and API scenarios. With this,
> > the API can now error-out without exiting the backend.
> >
>
> +1 on the idea. Few comments:
>
Thanks for the review.
> 1)
> It was not clear initially as to why SetLatch is not done in
> HandleSlotSyncShutdownInterrupt(), digging it further revealed that
> procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
> add below comment at the end of HandleSlotSyncShutdownInterrupt()
> similar to how other functions (HandleProcSignalBarrierInterrupt,
> HandleRecoveryConflictInterrupt etc) do.
>
> /* latch will be set by procsignal_sigusr1_handler */
>
Fixed.
> 2)
> In ProcessSlotSyncInterrupts(), now we don't need the below logic right?
>
> if (SlotSyncCtx->stopSignaled)
> {
> if (AmLogicalSlotSyncWorkerProcess())
> {
> ...
> proc_exit(0);
> }
> else
> {
> /*
> * For the backend executing SQL function
> * pg_sync_replication_slots().
> */
> ereport(ERROR,
> errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("replication slot synchronization will stop
> because promotion is triggered"));
> }
> }
>
Right. Attached patch with the suggested changes.
--
Thanks,
Nisha
| Attachment | Content-Type | Size |
|---|---|---|
| v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch | application/octet-stream | 8.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2026-03-27 05:17:57 | Re: Refactor query normalization into core query jumbling |
| Previous Message | Sami Imseih | 2026-03-27 04:49:11 | Re: Clean up NamedLWLockTranche stuff |