Re: Improve pg_sync_replication_slots() to wait for primary to advance

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date: 2025-12-03 04:00:49
Message-ID: CAJpy0uBktLChk4ZGHChfPtdBA2kEmDwS4Q_hbiBS7Wk9WYQRLQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Tue, Dec 2, 2025 at 8:35 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> > >
> > >
> > > Attached patch v27 addresses the above comments.
> > >
> >
> > Thanks for the patch. Please find a few comments:
> >
> > 1)
> > + /* The worker pid must not be already assigned in SlotSyncCtx */
> > + Assert(SlotSyncCtx->pid == InvalidPid);
> > +
> >
> > We can mention just 'pid' here instead of 'worker pid'
> >
>
> Changed.
>
> > 2)
> > + /*
> > + * The syscache access in fetch_or_refresh_remote_slots() needs a
> > + * transaction env.
> > + */
> > fetch_or_refresh_remote_slots --> fetch_remote_slots
> >
> > 3)
> > SyncReplicationSlots(WalReceiverConn *wrconn)
> > {
> > +
> >
> > We can get rid of this blank line at the start of the function.
> >
>
> Fixed.
>
> > 4)
> > /*
> > * Shut down the slot sync worker.
> > *
> > - * This function sends signal to shutdown slot sync worker, if required. It
> > + * This function sends signal to shutdown the slot sync process, if
> > required. It
> > * also waits till the slot sync worker has exited or
> > * pg_sync_replication_slots() has finished.
> > */
> > Shall we change comment to something like (rephrase if required):
> >
> > Shut down the slot synchronization.
> > This function wakes up the slot sync process (either worker or backend
> > running SQL function) and sets stopSignaled=true
> > so that worker can exit or SQL function pg_sync_replication_slots()
> > can finish. It also waits till the slot sync worker has exited or
> > pg_sync_replication_slots() has finished.
> >
>
> Changed.
>
> > 5)
> > We should change the comment atop 'SlotSyncCtxStruct' as well to
> > mention that this pid is either the slot sync worker's pid or
> > backend's pid running the SQL function. It is needed by the startup
> > process to wake these up, so that they can stop synchronization on
> > seeing stopSignaled. <please rephrase as needed>
> >
>
> Changed.
>
> > 6)
> > + ereport(LOG,
> > + errmsg("replication slot synchronization worker is shutting down on
> > receiving SIGUSR1"));
> >
> > SIGUSR1 was actually just a wake-up signal. We may change the comment to:
> > replication slot synchronization worker is shutting down as promotion
> > is triggered.
> >
>
> Changed.
>
> > 7)
> > update_synced_slots_inactive_since:
> > /* The slot sync worker or SQL function mustn't be running by now */
> > - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
> > + Assert(!SlotSyncCtx->syncing);
> >
> > Regarding this, I see that 'update_synced_slots_inactive_since' is
> > only called when we are sure that 'syncing' is false. So shouldn't pid
> > be also Invalid by that time? Even if it was backend's pid to start
> > with, but since backend has stopped syncing (finished or error-ed
> > out),
> > pid should be reset to Invalid in such a case. And this Assert need
> > not to be changed.
> >
>
> Fixed.
>
> > 8)
> >
> > + if (sync_process_pid != InvalidPid)
> > + kill(sync_process_pid, SIGUSR1);
> >
> > We can write comments to say wake-up slot sync process.
> >
>
> Added comments.
>
> Attaching patch v28 addressing these comments.
>

Thanks for the patch. A few trivial comments:

1)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot continue replication slots synchronization"
+ " as standby promotion is triggered"));

slots->slot, since in every error-message we use 'replication slot
synchronization'

2)
+ * The pid is either the slot sync worker's pid or the backend's pid running
+ * the SQL function pg_sync_replication_slots(). It is needed by the startup
+ * process to wake these up, so that they can stop synchronization on seeing
+ * stopSignaled on promotion.
+ * Setting stopSignaled is also used to handle the race condition when the

Can we rephrase slightly to indicate clearly that it is the startup
process which sets 'stopSignaled' during promotion. Suggestion:

The pid is either the slot sync worker’s pid or the backend’s pid running
the SQL function pg_sync_replication_slots(). When the startup process
sets stopSignaled during promotion, it uses this pid to wake the
currently synchronizing process so that the process can immediately stop its
synchronization work upon seeing stopSignaled set to true.
Setting stopSignaled....

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-12-03 04:16:38 Re: Refactor how we form HeapTuples for CatalogTuple(Insert|Update)
Previous Message shveta malik 2025-12-03 03:26:01 Re: Skipping schema changes in publication