Re: Improve pg_sync_replication_slots() to wait for primary to advance

From: Ajin Cherian <itsajin(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date: 2025-11-12 08:24:18
Message-ID: CAFPTHDYoxLaz3k6EPd9kf-sP3piRNR1v4ovxwmZ1r6copO=MJw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 10, 2025 at 8:31 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli(at)hotmail(dot)com> wrote:
> >
> > >
> > > Attaching patch v22 addressing the above comments.
> >
> > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
> > LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
> > LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
> > RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
> > -REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
> > REPLICATION_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."
> > +REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot synchronization."
> > SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
> > WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
> > WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
> >
> > I've noticed that all events are sorted alphabetically. I think we should keep
> > the order of REPLICATION_SLOTSYNC_MAIN unchanged.
> >
>
> +1.
>

Yes, changed.

> Few trivial comments:
>
> 1)
> Since we have always used the term 'SQL function' rather than API in
> existing code, shall we change all references of API to 'SQL function'
> in current patch:
>
> + * If the pg_sync_replication API is used to sync the slots, and if the slots
> "If the SQL function pg_sync_replication_slots() is used.."
>
> + * the reasons mentioned above, then the API also waits and retries until the
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * the API can retry.
> API --> SQL function
>
> + /* Set this, so that API can retry */
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * slot_persistence_pending - boolean used by pg_sync_replication_slots
> + * API to track if any slots could not be
> pg_sync_replication_slots API --> SQL function pg_sync_replication_slots()
>
> + * Interrupt handler for pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
>
> 2)
> ProcessSlotSyncAPIInterrupts
> slotsync_api_reread_config
> -- These also have API in it, but I do not have any better name
> suggestions here, we can retain the current ones and see what others
> say.
>

Changed.

> 3)
> /*
> * Re-read the config file.
> *
> * Exit if any of the slot sync GUCs have changed. The postmaster will
> * restart it.
> */
> static void
> slotsync_reread_config(void)
>
> Shall we change this existing comment to: Re-read the config file for
> slot sync worker.
>

> 4)
>
> +/*
> + * Re-read the config file and check for critical parameter changes.
> + *
> + */
> +static void
> +slotsync_api_reread_config(void)
>
> Shall we change comment to:
> /*
> * Re-read the config file for SQL function pg_sync_replication_slots()
> *
> * Emit error if any of the slot sync GUCs have changed.
> */
>
Changed.

On Mon, Nov 10, 2025 at 9:44 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> >
> >
> > 2)
> > ProcessSlotSyncAPIInterrupts
> > slotsync_api_reread_config
> > -- These also have API in it, but I do not have any better name
> > suggestions here, we can retain the current ones and see what others
> > say.
>
> ProcessSlotSyncInterrupts() handles shutdown waiting,
> ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
> be good to explain why we need two different functions for worker and
> SQL function and also explain the difference between them.
>

I've updated the function header to explain this. The slot sync worker
is a specific background worker while the API runs in the regular
backend, so special handling is not needed.

> $primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
> +$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
>
> I think, the intention behind dropping the slot is to be able to
> create it again for the next test. But there is no comment here
> explaining that. There is also no comment explaining why we are
> dropping both slots here; when the test only needs dropping one.
> That's going to create confusion. One might think that all the slots
> need to be dropped at this stage, and drop and create any future slots
> that are used by prior code, for example. At the end of this test, we
> recreate the slot using pg_create_logical_replication_slot(), which is
> different method of creating slot than this test does. Though I can
> understand the reason, it's not apparent. Generally reusing slot names
> across multiple tests (in this file) is a source of confusion. But at
> least for the test you are adding, you could use a different slot name
> to avoid confusion.

I've added a comment there that dropping both the slots is required
for the next test. Also I cannot change the name of the slot as the
next tests need the same slot synced.

> +# Create some DDL on the primary so that the slot lags behind the standby
> +$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
> +
> +# Attempt to synchronize slots using API. The API will continue retrying
> +# synchronization until the remote slot catches up.
> +# The API will not return until this happens, to be able to make
> +# further calls, call the API in a background process.
> +my $log_offset = -s $standby1->logfile;
> +
> +my $h = $standby1->background_psql('postgres', on_error_stop => 0);
> +
> +$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
>
> If the standby does not receive the WAL corresponding to the DDL
> before this function is executed, the slot will get synchronized
> immediately. I think we have to make sure that the standby has
> received the DDL before executing this function.

Yes, I've added a line to make sure that the stanbdy has caught up.

>
> Also most of the code which uses query_until has pattern like this:
> $h->query_until(qr/start/, q{\echo start
> SQL command});
> But we expect an empty string here. Why this difference?
>

I've modified it as suggested.

> I think we need a similar test to test promotion while the function is
> waiting for the slot to become sync-ready.
>

Unfortunately, that will make this test too long if I add one more
wait loop for slot sync.

> SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
> are similar with some functional differences. Some part of their code
> needs to be kept in sync in future. How do we achieve that? At least
> we need a comment saying so in each of those patches and keep those
> two codes in proximity.
>

I've added a comment in the header of ReplSlotSyncWorkerMain to suggest this.

Attaching patch v23 addressing these comments.

regards,
Ajin Cherian
Fujitsu Australia

Attachment Content-Type Size
v23-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch application/octet-stream 27.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message zengman 2025-11-12 08:29:54 Re: Extended Statistics set/restore/clear functions.
Previous Message Fujii Masao 2025-11-12 08:23:13 Fix incorrect assignment of InvalidXLogRecPtr to a non-LSN variable.