| From: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
|---|---|
| To: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
| Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Subject: | Re: How can end users know the cause of LR slot sync delays? |
| Date: | 2025-11-26 04:54:59 |
| Message-ID: | CAJpy0uCJxXiKTwTSpO==uDARPGsNVYgWiXJht9j90mq0bBvTtA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Nov 26, 2025 at 9:30 AM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
>
> On Wed, 26 Nov 2025 at 09:00, Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Dear Hou,
> >
> > > I think we did not sync slots to standby2, so I removed the checks for that.
> > >
> > > I also adjusted the test in a way that it cleans up existing slots before starting
> > > new tests.
> >
> > Thanks for updating the patch. I confirmed on my env that your patch could be
> > applied cleanly and tests were passed. Pgperltidy say nothing for your patch.
> > Also, I preferred the current style.
> >
> > I think it is worth checking on BF to see how they say.
> >
>
> Thanks Amit for pushing the 0001 patch.
> Thanks Hou-san and Kuroda-san on fixing the test.
>
> I have rebased the 0002 patch on the current HEAD.
>
Thanks. Please find a few comments:
1)
+ The reason for the last slot synchronization skip. This field
is set only
+ for logical slots that are being synchronized from a primary
server (that
+ is, those whose <structfield>synced</structfield> field is
+ <literal>true</literal>). The value of this column has no meaning on the
+ primary server; it defaults to <literal>none</literal> for all
slots, but
+ may (if leftover from a promoted standby) also have a value other than
+ <literal>none</literal>. Possible values are:
We can make this similar to existing fields (slotsync_skip_count and
slotsync_skip_at):
Slot synchronization occurs only on standby servers and thus this column has
no meaning on the primary server.
2)
Doc on possible values of slotsync_skip_reason can be improved. Example
+ <literal>remote_behind</literal> means that the last slot
+ synchronization was skipped because the slot is ahead of the
+ corresponding failover slot on the primary.
We can get rid of 'last slot synchronization was skipped' from all the
reasons. (See 'invalidation_reason' possible values for reference, it
does not mention 'was invalidated' in any).
3)
+ <literal>wal_not_flushed</literal> means that the last slot
+ synchronization was skipped because the standby had not flushed the
+ WAL corresponding to the confirmed flush position on the remote slot.
I am not sure if we need to mention 'confirmed flush position'. Shall we say:
'.....because the standby had not flushed the WAL corresponding to the
position reserved on the remote slot'.
Thoughts?
4)
+ <literal>none</literal> means that the last slot synchronization
+ completed successfully.
Do we even need to mention 'none' in doc? 'invalidation_reason' does
not mention it.
5)
postgres=# select slot_name, invalidation_reason, slotsync_skip_reason
from pg_replication_slots;
slot_name | invalidation_reason | slotsync_skip_reason
----------------+---------------------+----------------------
failover_slot2 | | none
Shall we keep 'slotsync_skip_reason' as NULL instead of 'none' similar
to invalidation_reason. Thoughts?
6)
If we plan to keep slotsync_skip_reason as NULL instead of 'none' for
non-skipped cases (above comment), then below code can be optimised as
we do not need to update 'none' as stats.
'skip_reason' and last update() call can then be removed and we can
simply call update_slotsync_skip_stats() instead of 'skip_reason =
SS_SKIP_NO_CONSISTENT_SNAPSHOT'.
update_local_synced_slot():
+ SlotSyncSkipReason skip_reason = SS_SKIP_NONE;
+ update_slotsync_skip_stats(SS_SKIP_REMOTE_BEHIND);
+ skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
+ /* Update slot sync skip stats */
+ update_slotsync_skip_stats(skip_reason);
7)
+static char *
+GetSlotSyncSkipReason(SlotSyncSkipReason reason)
We are passing 'SlotSyncSkipReason' and function name says
'GetSlotSyncSkipReason', looks confusing.
Shall we rename function name to GetSlotSyncSkipString or
GetSlotSyncSkipReasonName (similar to GetSlotInvalidationCauseName)
8)
+ * A slotsync skip typically occurs only for temporary slots. For
+ * persistent slots it is extremely rare (e.g., cases like
+ * SS_SKIP_WAL_NOT_FLUSHED or SS_SKIP_REMOTE_BEHIND). Also, temporary
+ * slots are dropped after server restart, so there is no value in
+ * persisting the slotsync_skip_reason.
+ */
+ SlotSyncSkipReason slotsync_skip_reason;
I feel, 'Also-->Since' will make more sense here.
9)
In doc for slotsync_skip_count and slotsync_skip_at
+ Slot
+ synchronization occur only on standby servers and thus this column has
+ no meaning on the primary server.
occur --> occurs
thanks
Shveta
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Kapila | 2025-11-26 05:01:04 | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
| Previous Message | Michael Paquier | 2025-11-26 04:54:03 | Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem) |