| From: | Ajin Cherian <itsajin(at)gmail(dot)com> |
|---|---|
| To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
| Date: | 2025-11-21 03:44:25 |
| Message-ID: | CAFPTHDYgcvZ60eHWUav3VQSeVibivx7A31rp_pFAkMQrW=j=5A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Nov 19, 2025 at 5:10 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> >
> > Attaching patch v23 addressing these comments.
> >
>
> Few comments:
> =============
> 1.
> In contrast, automatic synchronization
> via <varname>sync_replication_slots</varname> provides continuous slot
> updates, enabling seamless failover and supporting high availability.
> - Therefore, it is the recommended method for synchronizing slots.
>
> I think a slotsync worker should still be a recommended method. So, we
> shouldn't remove the last line.
>
Changed.
> 2. I think we can unify slotsync_api_reread_config() and
> ProcessSlotSyncAPIInterrupts() with corresponding existing functions
> for slotsync worker. Having separate functions for API and worker to
> handle interrupts looks odd and bug-prone w.r.t future changes in this
> area.
>
I’ve refactored and unified slotsync_api_reread_config() and
ProcessSlotSyncAPIInterrupts() with their existing counterparts. As
part of this, I switched the shutdown signal for slot-syncing workers
and backends from SIGINT to SIGUSR1, so that all slot-synchronization
processes use a consistent signaling model. Background workers handle
SIGINT via StatementCancelHandler, which is not suitable for
coordinated shutdowns, but SIGUSR1 reliably sets the process latch and
wakes up both worker types. Once awakened, these processes invoke
ProcessSlotSyncInterrupts(), which now checks
SlotSyncCtx->stopSignaled to perform a clean shutdown with appropriate
logs.
> 3.
> +/*
> + * Helper function to extract slot names from a list of remote slots
> + */
> +static List *
> +extract_slot_names(List *remote_slots)
> +{
> + List *slot_names = NIL;
> + MemoryContext oldcontext;
> +
> + /* Switch to long-lived TopMemoryContext to store slot names */
> + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
> +
> + foreach_ptr(RemoteSlot, remote_slot, remote_slots)
>
> Why did we allocate this memory in TopMemoryContext here? We only need
> till this function executes, isn't CurrentMemoryContext (which I think
> is ExprContext) sufficient, if not, why? If we use some
> function/query-level context then we don't need to make it part of
> SlotSyncApiFailureParams. If we can get rid of slot_names from
> SlotSyncApiFailureParams then we probably don't need struct
> SlotSyncApiFailureParams.
>
In further testing, I've found that a Transaction is always started
when pg_sync_replication_slots() is called, and there was no need to
start new transactions and there was no need for a seperate memory
context for remote_slots. I think the confusion was probably from an
earlier bug in my code. I have added an assert to make sure that
transactions are started when in the API (should I make it an error
instead?).
On Wed, Nov 19, 2025 at 6:05 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> >
> >
> > Attaching patch v23 addressing these comments.
> >
>
> Thanks for the patch.
>
> I observed that if the API is taking a nap in between slot sync cycles
> and a promotion is triggered during that time, the promotion has to
> wait for the entire nap period to finish before slot-sync stops and
> the process can continue. There should be a mechanism to wake up the
> backend so the API can exit early once stopSignaled is set. How about
> doing SetLatch for the process doing synchronization in
> ShutDownSlotSync()?
>
Yes. To address this, I now also store the background worker pid which
is calling pg_sync_replication_slots() in SlotSyncCtx->pid, and I've
modified the ShutDownSlotSync logic to issue a SIGUSR1 which will
SetLatch on SlotSyncCtx->pid.
Attaching patch v24, addressing the above comments.
regards,
Ajin Cherian
Fujitsu Australia
| Attachment | Content-Type | Size |
|---|---|---|
| v24-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch | application/octet-stream | 30.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Zhijie Hou (Fujitsu) | 2025-11-21 03:47:30 | RE: Assertion failure in SnapBuildInitialSnapshot() |
| Previous Message | Chao Li | 2025-11-21 03:36:23 | Re: Extended Statistics set/restore/clear functions. |