Quick Links

Re: Improve pg_sync_replication_slots() to wait for primary to advance

From:	shveta malik <shveta(dot)malik(at)gmail(dot)com>
To:	Ajin Cherian <itsajin(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject:	Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date:	2025-08-29 05:01:39
Message-ID:	CAJpy0uASzojKbzinpNu29xuYGsSRnSo=22CLhXaSt_43TVoBhQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Aug 26, 2025 at 9:58 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Fri, Aug 22, 2025 at 3:44 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Wed, Aug 20, 2025 at 10:53 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> > >
> > >
> > > I've removed them.
> > > Attaching patch v8 addressing the above comments.
> > >
> >
> > Thanks for the patch. Please find a few comments:
> >
> > 1)
> > When the API is in progress, and meanwhile in another session we turn
> > off hot_standby_feedback, the API session terminates abnormally.
> >
> > postgres=# SELECT pg_sync_replication_slots();
> > server closed the connection unexpectedly
> >
> > It seems slotsync_reread_config() is not adjusted for API. It does
> > proc_exit assuming it is a worker process.
> >
>
> I've removed the API calling ProcessSlotSyncInterrupts() as I don't
> think the API needs to specifically handle shutdown requests, it works
> without calling this. And the other config checks, I don't think the
> API needs to handle, I think we should leave it to the user.
>
> > 2)
> > slotsync_worker_onexit():
> >
> > SlotSyncCtx->slot_persistence_pending = false;
> >
> > /*
> > * If syncing_slots is true, it indicates that the process errored out
> > * without resetting the flag. So, we need to clean up shared memory and
> > * reset the flag here.
> > */
> > if (syncing_slots)
> > {
> > SlotSyncCtx->syncing = false;
> > syncing_slots = false;
> > }
> >
> > Shall we reset slot_persistence_pending inside 'if (syncing_slots)'?
> > slot_persistence_pending can not be true without syncing_slots being
> > true.
> >
> > 3)
> > reset_syncing_flag():
> >
> > SpinLockAcquire(&SlotSyncCtx->mutex);
> > SlotSyncCtx->syncing = false;
> > + SlotSyncCtx->slot_persistence_pending = false;
> > SpinLockRelease(&SlotSyncCtx->mutex);
> >
> > Here we are changing slot_persistence_pending under mutex, while at
> > other places, it is not protected by mutex. Is it intentional here?
> >
> > 4)
> > On rethinking, we maintain anything in shared memory if it has to be
> > shared between a few processes. 'slot_persistence_pending' OTOH is
> > required to be set and accessed by only one process at a time. Shall
> > we move it out of SlotSyncCtxStruct and keep it static similar to
> > 'syncing_slots'? Rest of the setting, resetting flow remains the same.
> > What do you think?
> >
>
> Yes, I agree. I have modified it accordingly.
>
> >
> > 5)
> > /* Build the query based on whether we're fetching all or refreshing
> > specific slots */
> >
> > Perhaps we can shift this comment to where we actually append
> > target_slot_list. Better to have it something like:
> > 'If target_slot_list is provided, construct the query only to fetch given slots'
> >
>
> Changed.
>
> Attaching patch v9 addressing the above comments.
>

Thank You for the patches. Please find a few comments.

1)
Change not needed:

* without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
+ *
*/

2)
Regarding the naptime in API, I was thinking why not to use
wait_for_slot_activity() directly? If there are no slots activity, it
will keep on doubling the naptime starting from 2sec till max of
30sec. Thoughts?

We can rename MIN_SLOTSYNC_WORKER_NAPTIME_MS and
MAX_SLOTSYNC_WORKER_NAPTIME_MS to MIN_SLOTSYNC_NAPTIME_MS and
MAX_SLOTSYNC_NAPTIME_MS in such a case.

3)
+ * target_slot_list - List of RemoteSlot structures to refresh, or NIL to
+ * fetch all failover slots

Can we please change it to:

List of failover logical slots to fetch from primary, or NIL to fetch
all failover logical slots

4)
In the worker, before each call to synchronize_slots(), we are
starting a new transaction. It aligns with the previous implementation
where StartTransaction was inside synchronize_slots(). But in API, we
are doing StartTransaction once outside of the loop instead of doing
before each synchronize_slots(), is it intentional? It may keep the
transaction open for a long duration for the case where slots are not
getting persisted soon.

5)
With ProcessSlotSyncInterrupts() being removed from API, can you
please check the behaviour of API on smart-shutdown and rest of the
shutdown modes? It should behave like other APIs. And what happens if
I change primary_conninfo to some non-existing server when the API is
running. Does it error out or keep retrying?

thanks
Shveta

In response to

Re: Improve pg_sync_replication_slots() to wait for primary to advance at 2025-08-26 04:28:30 from Ajin Cherian

Responses

Re: Improve pg_sync_replication_slots() to wait for primary to advance at 2025-08-29 06:12:39 from Ajin Cherian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2025-08-29 05:23:27	Re: Remove unneeded cast in heap_xlog_lock.
Previous Message	Masahiko Sawada	2025-08-29 04:28:59	Re: POC: enable logical decoding when wal_level = 'replica' without a server restart