Re: How can end users know the cause of LR slot sync delays?

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: How can end users know the cause of LR slot sync delays?
Date: 2025-09-22 10:11:33
Message-ID: CANhcyEWoXbzRL6xTcahBtuV1vzRH8NiDu0fNjby1XDJ1jW4Ckw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 18 Sept 2025 at 13:17, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi Amit,
>
> On Thu, Sep 18, 2025 at 11:31 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Sep 17, 2025 at 8:19 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
> > > > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > > > >
> > > > > Dear Shlok,
> > > > >
> > > > > Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
> > > > > indicate the current status of synchronization, it just shows the history.
> > > > > I feel approach2 has more information than approach1.
> > > > >
> > > >
> > > > I also think so but Ashutosh thought that it would be hacky. Ashutosh,
> > > > did you have an opinion on this matter after seeing the patches?
> > > >
> > >
> > > Yes, I’ve looked into both the patches. Approach 1 seems quite
> > > straightforward. In approach 2, we need to pass some additional
> > > arguments to update_local_sync_slot and
> > > update_and_persist_local_synced_slot, which makes it feel a little
> > > less clean compared to approach 1, where we simply add a new function
> > > and call it directly.
> > >
> >
> > This is because the approach-1 doesn't show the latest value of
> > sync_status. I mean in the latest cycle if the sync is successful, it
> > won't update the stats which I am not sure is correct because users
> > may want to know the recent status of sync cycle. Otherwise, the patch
> > should be almost the same.
>
> This should be manageable, no? If we add an additional call to the
> stats report function immediately after ReplicationSlotPersist(),
> wouldn’t that address the issue? Please correct me if I’m overlooking
> something.
>
> @@ -600,6 +600,8 @@ update_and_persist_local_synced_slot(RemoteSlot
> *remote_slot, Oid remote_dbid)
>
> ReplicationSlotPersist();
>
> + pgstat_report_replslot_sync_skip(slot, SLOT_SYNC_SKIP_NONE);
> +
> ereport(LOG,
> errmsg("newly created replication slot \"%s\"
> is sync-ready now",
> remote_slot->name));
>
Currently, in this code change, the skip_reason is updated to 'none'
only when the slot state changes from temporary to persistent.
However, this logic does not handle cases where the slot is already a
persistent sync slot.
I believe we should also sync skip can happen for persistent slots.
That means, on a successful slot sync, we should update the
skip_reason to 'none' even for slots that are already persistent.

> In addition to this, should anyone really need to query the skip
> reason if pg_replication_slots already shows that the slot is synced
> and not temporary? Ideally, users should check the slot status in
> pg_replication_slots, and if it indicates the slot is persisted, there
> seems little value in enquiring pg_stat_replication_slots for the skip
> reason. That said, it’s important to ensure the information in both
> views remains consistent.
>
I have a doubt. Why don't we want to report the sync skip reason once
the slots are persisted?
for the case:

latestFlushPtr = GetStandbyFlushRecPtr(NULL);
if (remote_slot->confirmed_lsn > latestFlushPtr)
{
/*
* Can get here only if GUC 'synchronized_standby_slots' on the
* primary server was not configured correctly.
*/
ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("skipping slot synchronization because the
received slot sync"
" LSN %X/%08X for slot \"%s\" is ahead of the
standby position %X/%08X",
LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
remote_slot->name,
LSN_FORMAT_ARGS(latestFlushPtr)));

return false;
}

Slot sync skip can happen even for persistent slots. So why should we
avoid displaying the skip reason in such cases?
I checked that if the synchronized_standby_slots GUC is not set
properly, we can hit this condition even for persistent slots.
I think we should still display the skip reason if the user does not
configure this GUC as expected.

> > I think we can even try to write a patch
> > for approach-2 without an additional out parameter in some of the
> > functions.
>
> We can aim for this, if possible.

Thanks,
Shlok Kyal

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2025-09-22 10:51:14 Re: Improve pg_sync_replication_slots() to wait for primary to advance
Previous Message Pavel Stehule 2025-09-22 10:03:26 Re: [PATCH] Introduce unified support for composite GUC options