| From: | Ajin Cherian <itsajin(at)gmail(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
| Date: | 2025-11-12 08:24:18 |
| Message-ID: | CAFPTHDYoxLaz3k6EPd9kf-sP3piRNR1v4ovxwmZ1r6copO=MJw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Nov 10, 2025 at 8:31 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli(at)hotmail(dot)com> wrote:
> >
> > >
> > > Attaching patch v22 addressing the above comments.
> >
> > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
> > LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
> > LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
> > RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
> > -REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
> > REPLICATION_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."
> > +REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot synchronization."
> > SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
> > WAL_RECEIVER_MAIN "Waiting in main loop of WAL receiver process."
> > WAL_SENDER_MAIN "Waiting in main loop of WAL sender process."
> >
> > I've noticed that all events are sorted alphabetically. I think we should keep
> > the order of REPLICATION_SLOTSYNC_MAIN unchanged.
> >
>
> +1.
>
Yes, changed.
> Few trivial comments:
>
> 1)
> Since we have always used the term 'SQL function' rather than API in
> existing code, shall we change all references of API to 'SQL function'
> in current patch:
>
> + * If the pg_sync_replication API is used to sync the slots, and if the slots
> "If the SQL function pg_sync_replication_slots() is used.."
>
> + * the reasons mentioned above, then the API also waits and retries until the
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * the API can retry.
> API --> SQL function
>
> + /* Set this, so that API can retry */
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * slot_persistence_pending - boolean used by pg_sync_replication_slots
> + * API to track if any slots could not be
> pg_sync_replication_slots API --> SQL function pg_sync_replication_slots()
>
> + * Interrupt handler for pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
>
> 2)
> ProcessSlotSyncAPIInterrupts
> slotsync_api_reread_config
> -- These also have API in it, but I do not have any better name
> suggestions here, we can retain the current ones and see what others
> say.
>
Changed.
> 3)
> /*
> * Re-read the config file.
> *
> * Exit if any of the slot sync GUCs have changed. The postmaster will
> * restart it.
> */
> static void
> slotsync_reread_config(void)
>
> Shall we change this existing comment to: Re-read the config file for
> slot sync worker.
>
> 4)
>
> +/*
> + * Re-read the config file and check for critical parameter changes.
> + *
> + */
> +static void
> +slotsync_api_reread_config(void)
>
> Shall we change comment to:
> /*
> * Re-read the config file for SQL function pg_sync_replication_slots()
> *
> * Emit error if any of the slot sync GUCs have changed.
> */
>
Changed.
On Mon, Nov 10, 2025 at 9:44 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> >
> >
> > 2)
> > ProcessSlotSyncAPIInterrupts
> > slotsync_api_reread_config
> > -- These also have API in it, but I do not have any better name
> > suggestions here, we can retain the current ones and see what others
> > say.
>
> ProcessSlotSyncInterrupts() handles shutdown waiting,
> ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
> be good to explain why we need two different functions for worker and
> SQL function and also explain the difference between them.
>
I've updated the function header to explain this. The slot sync worker
is a specific background worker while the API runs in the regular
backend, so special handling is not needed.
> $primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
> +$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
>
> I think, the intention behind dropping the slot is to be able to
> create it again for the next test. But there is no comment here
> explaining that. There is also no comment explaining why we are
> dropping both slots here; when the test only needs dropping one.
> That's going to create confusion. One might think that all the slots
> need to be dropped at this stage, and drop and create any future slots
> that are used by prior code, for example. At the end of this test, we
> recreate the slot using pg_create_logical_replication_slot(), which is
> different method of creating slot than this test does. Though I can
> understand the reason, it's not apparent. Generally reusing slot names
> across multiple tests (in this file) is a source of confusion. But at
> least for the test you are adding, you could use a different slot name
> to avoid confusion.
I've added a comment there that dropping both the slots is required
for the next test. Also I cannot change the name of the slot as the
next tests need the same slot synced.
> +# Create some DDL on the primary so that the slot lags behind the standby
> +$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
> +
> +# Attempt to synchronize slots using API. The API will continue retrying
> +# synchronization until the remote slot catches up.
> +# The API will not return until this happens, to be able to make
> +# further calls, call the API in a background process.
> +my $log_offset = -s $standby1->logfile;
> +
> +my $h = $standby1->background_psql('postgres', on_error_stop => 0);
> +
> +$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
>
> If the standby does not receive the WAL corresponding to the DDL
> before this function is executed, the slot will get synchronized
> immediately. I think we have to make sure that the standby has
> received the DDL before executing this function.
Yes, I've added a line to make sure that the stanbdy has caught up.
>
> Also most of the code which uses query_until has pattern like this:
> $h->query_until(qr/start/, q{\echo start
> SQL command});
> But we expect an empty string here. Why this difference?
>
I've modified it as suggested.
> I think we need a similar test to test promotion while the function is
> waiting for the slot to become sync-ready.
>
Unfortunately, that will make this test too long if I add one more
wait loop for slot sync.
> SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
> are similar with some functional differences. Some part of their code
> needs to be kept in sync in future. How do we achieve that? At least
> we need a comment saying so in each of those patches and keep those
> two codes in proximity.
>
I've added a comment in the header of ReplSlotSyncWorkerMain to suggest this.
Attaching patch v23 addressing these comments.
regards,
Ajin Cherian
Fujitsu Australia
| Attachment | Content-Type | Size |
|---|---|---|
| v23-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch | application/octet-stream | 27.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | zengman | 2025-11-12 08:29:54 | Re: Extended Statistics set/restore/clear functions. |
| Previous Message | Fujii Masao | 2025-11-12 08:23:13 | Fix incorrect assignment of InvalidXLogRecPtr to a non-LSN variable. |