| From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Cc: | Ajin Cherian <itsajin(at)gmail(dot)com>, Yilin Zhang <jiezhilove(at)126(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Japin Li <japinli(at)hotmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | RE: Improve pg_sync_replication_slots() to wait for primary to advance |
| Date: | 2026-03-04 06:56:40 |
| Message-ID: | TY4PR01MB16907630F7DE9263795978897947CA@TY4PR01MB16907.jpnprd01.prod.outlook.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tuesday, February 17, 2026 12:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Feb 17, 2026 at 9:13 AM shveta malik <shveta(dot)malik(at)gmail(dot)com>
> wrote:
> >
> > On Mon, Feb 16, 2026 at 4:35 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > On Fri, Feb 13, 2026 at 7:54 AM Zhijie Hou (Fujitsu)
> > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > Thanks for pushing! Here are the remaining patches.
> > > >
> > >
> > > One thing that bothers me about the remaining patch is that it could
> > > lead to infinite re-tires in the worst case. For example, in first
> > > try, slot-1 is not synced say due to physical replication delays in
> > > flushing WALs up to the confirmed_flush_lsn of that slot, then in
> > > next (re-)try, the same thing happened for slot-2, then in next
> > > (re-)try,
> > > slot-3 appears to invalidated on standby but it is valid on primary,
> > > and so on. What do you think?
> >
> > Yes, that is a possibility we cannot rule out. This can also happen
> > during the first invocation of the API (even without the new changes)
> > when we attempt to create new slots, they may remain in a temporary
> > state indefinitely. However, that risk is limited to the initial sync,
> > until the slots are persisted, which is somewhat expected behavior.
> >
>
> Right.
>
> > With the current changes though, the possibility of an indefinite wait
> > exists during every run. So the question becomes: what would be more
> > desirable for users -- for the API to finish with the risk that a few
> > slots are not synced, or for the API to wait longer to ensure that all
> > slots are properly synced?
> >
> > I think that if the primary use case of this API is when a user plans
> > to run it before a scheduled failover, then it would be better for the
> > API to wait and ensure everything is properly synced.
> >
>
> I don't think we can guarantee that all slots are synced as per latest primary
> state in one invocation because some newly created slots can anyway be
> missed. So why take the risk of infinite waits in the API? I think it may be
> better to extend the usage of this API (probably with more parameters) based
> on more user feedback.
>
I agree that we could wait for more user feedback before extending the
retry logic to persisted slots.
> > > Independent of whether we consider the entire patch, the following
> > > bit in the patch in useful as we retry to sync the slots via API.
> > > @@ -218,7 +219,7 @@ update_local_synced_slot(RemoteSlot
> > > *remote_slot, Oid remote_dbid)
> > > * Can get here only if GUC 'synchronized_standby_slots' on the
> > > * primary server was not configured correctly.
> > > */
> > > - ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
> > > + ereport(LOG,
> > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > errmsg("skipping slot synchronization because the received slot sync"
> > > " LSN %X/%08X for slot \"%s\" is ahead of the standby position
> > > %X/%08X",
> > >
> >
> > yes. I agree.
> >
>
> Let's wait for Hou-San's opinion on this one.
+1 for changing this.
Here is the patch set to convert elevel to LOG so that the function cyclically
retry until the standby catches up and the slot is successfully persisted.
Best Regards,
Hou zj
| Attachment | Content-Type | Size |
|---|---|---|
| v7-0002-Add-a-taptest.patch | application/octet-stream | 3.2 KB |
| v7-0001-Extend-the-retry-logic-in-pg_sync_replication_slo.patch | application/octet-stream | 3.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Daniil Davydov | 2026-03-04 06:58:49 | Re: POC: Parallel processing of indexes in autovacuum |
| Previous Message | Tatsuo Ishii | 2026-03-04 06:38:22 | Re: Row pattern recognition |