| From: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
|---|---|
| To: | Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
| Cc: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Subject: | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |
| Date: | 2026-03-31 10:42:00 |
| Message-ID: | CAJpy0uDaR4LKy2T=vLWuCnY8nQ=m7Zde_sr44aoYr7T0jodV2Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Mar 31, 2026 at 11:35 AM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
>
> On Mon, Mar 30, 2026 at 4:39 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >
> > On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
> > > We were using the same log message in two places:
> > > check_and_set_sync_info() and HandleSlotSyncMessage().
> > > I think “will not start” fits better in the first case, while “will
> > > stop” makes sense to keep in the second.
> >
> > Thanks for updating the patch!
> >
> > With the patch, in my testing, standby promotion always produces
> > the following logs:
> >
> > LOG: replication slot synchronization worker will stop because
> > promotion is triggered
> > LOG: replication slot synchronization worker will not start
> > because promotion was triggered
> >
> > It looks like the postmaster immediately restarts the slotsync worker after
> > promotion terminates it, and that new worker then exits on seeing
> > SlotSyncCtx->stopSignaled.
> >
> > IMO, always emitting both messages is a bit confusing. It would be nice to
> > suppress the second one if possible.
> >
> > One idea would be to prevent the restart altogether. For example,
> > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> > a special value (like -1), and SlotSyncWorkerCanRestart() could return
> > false (i.e., prevent postmater from starting up slotsync worker) when
> > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> > check SlotSyncCtx->stopSignaled.
> >
> > That said, as far as I remember correctly, postmaster is generally not
> > supposed to touch shared memory (per the comments in postmaster.c),
> > so I'm not sure this approach is acceptable. On the other hand,
> > postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> > so perhaps there's some precedent here.
> >
> IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
> may not be ideal, as it requires a spinlock to avoid races with the
> startup process and it is disallowed to take lock in postmaster main
> loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
> the postmaster accesses it only when the worker is not alive.
>
I agree.
> Another option could be to log in check_and_set_sync_info() at DEBUG1
> instead of LOG level. This message appears only after stopSignaled is
> set, when promotion is already in progress and the first worker has
> logged “will stop…”. The second worker doesn’t do any real work. Since
> there’s nothing actionable for users, using DEBUG1 would keep it
> useful for debugging (e.g., noticing immediate restarts) while
> avoiding extra log noise. Thoughts?
>
+1.
Do you think we can slightly tweak the comment in atop file to:
On promotion the startup process sets 'stopSignaled' and uses this
'pid' to signal synchronizing process with PROCSIG_SLOTSYNC_MESSAGE
and also to wake it up so that the process can immediately stop its
synchronizing work. Setting 'stopSignaled' on the other hand is used
to handle the race condition....
Also shall we quick exit ProcessSlotSyncMessage() if syncing is
already finished by API?
thanks
Shveta
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bertrand Drouvot | 2026-03-31 10:45:50 | Re: relfilenode statistics |
| Previous Message | Zhijie Hou (Fujitsu) | 2026-03-31 10:26:33 | RE: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |