Re: How can end users know the cause of LR slot sync delays?

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: How can end users know the cause of LR slot sync delays?
Date: 2025-11-26 04:54:59
Message-ID: CAJpy0uCJxXiKTwTSpO==uDARPGsNVYgWiXJht9j90mq0bBvTtA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 26, 2025 at 9:30 AM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> On Wed, 26 Nov 2025 at 09:00, Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Dear Hou,
> >
> > > I think we did not sync slots to standby2, so I removed the checks for that.
> > >
> > > I also adjusted the test in a way that it cleans up existing slots before starting
> > > new tests.
> >
> > Thanks for updating the patch. I confirmed on my env that your patch could be
> > applied cleanly and tests were passed. Pgperltidy say nothing for your patch.
> > Also, I preferred the current style.
> >
> > I think it is worth checking on BF to see how they say.
> >
>
> Thanks Amit for pushing the 0001 patch.
> Thanks Hou-san and Kuroda-san on fixing the test.
>
> I have rebased the 0002 patch on the current HEAD.
>

Thanks. Please find a few comments:

1)

+ The reason for the last slot synchronization skip. This field
is set only
+ for logical slots that are being synchronized from a primary
server (that
+ is, those whose <structfield>synced</structfield> field is
+ <literal>true</literal>). The value of this column has no meaning on the
+ primary server; it defaults to <literal>none</literal> for all
slots, but
+ may (if leftover from a promoted standby) also have a value other than
+ <literal>none</literal>. Possible values are:

We can make this similar to existing fields (slotsync_skip_count and
slotsync_skip_at):

Slot synchronization occurs only on standby servers and thus this column has
no meaning on the primary server.

2)
Doc on possible values of slotsync_skip_reason can be improved. Example

+ <literal>remote_behind</literal> means that the last slot
+ synchronization was skipped because the slot is ahead of the
+ corresponding failover slot on the primary.

We can get rid of 'last slot synchronization was skipped' from all the
reasons. (See 'invalidation_reason' possible values for reference, it
does not mention 'was invalidated' in any).

3)
+ <literal>wal_not_flushed</literal> means that the last slot
+ synchronization was skipped because the standby had not flushed the
+ WAL corresponding to the confirmed flush position on the remote slot.

I am not sure if we need to mention 'confirmed flush position'. Shall we say:

'.....because the standby had not flushed the WAL corresponding to the
position reserved on the remote slot'.
Thoughts?

4)
+ <literal>none</literal> means that the last slot synchronization
+ completed successfully.

Do we even need to mention 'none' in doc? 'invalidation_reason' does
not mention it.

5)
postgres=# select slot_name, invalidation_reason, slotsync_skip_reason
from pg_replication_slots;
slot_name | invalidation_reason | slotsync_skip_reason
----------------+---------------------+----------------------
failover_slot2 | | none

Shall we keep 'slotsync_skip_reason' as NULL instead of 'none' similar
to invalidation_reason. Thoughts?

6)
If we plan to keep slotsync_skip_reason as NULL instead of 'none' for
non-skipped cases (above comment), then below code can be optimised as
we do not need to update 'none' as stats.
'skip_reason' and last update() call can then be removed and we can
simply call update_slotsync_skip_stats() instead of 'skip_reason =
SS_SKIP_NO_CONSISTENT_SNAPSHOT'.

update_local_synced_slot():

+ SlotSyncSkipReason skip_reason = SS_SKIP_NONE;
+ update_slotsync_skip_stats(SS_SKIP_REMOTE_BEHIND);
+ skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
+ /* Update slot sync skip stats */
+ update_slotsync_skip_stats(skip_reason);

7)

+static char *
+GetSlotSyncSkipReason(SlotSyncSkipReason reason)

We are passing 'SlotSyncSkipReason' and function name says
'GetSlotSyncSkipReason', looks confusing.
Shall we rename function name to GetSlotSyncSkipString or
GetSlotSyncSkipReasonName (similar to GetSlotInvalidationCauseName)

8)

+ * A slotsync skip typically occurs only for temporary slots. For
+ * persistent slots it is extremely rare (e.g., cases like
+ * SS_SKIP_WAL_NOT_FLUSHED or SS_SKIP_REMOTE_BEHIND). Also, temporary
+ * slots are dropped after server restart, so there is no value in
+ * persisting the slotsync_skip_reason.
+ */
+ SlotSyncSkipReason slotsync_skip_reason;

I feel, 'Also-->Since' will make more sense here.

9)
In doc for slotsync_skip_count and slotsync_skip_at

+ Slot
+ synchronization occur only on standby servers and thus this column has
+ no meaning on the primary server.

occur --> occurs

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-11-26 05:01:04 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message Michael Paquier 2025-11-26 04:54:03 Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)