From: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
---|---|
To: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: How can end users know the cause of LR slot sync delays? |
Date: | 2025-09-05 07:19:55 |
Message-ID: | CAE9k0P=j=zAF_ADJseQ=8gq+uY6TMrTEiQsG7CzTx70KOSvxDw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Shlok,
Good to hear that you’re also interested in working on this task.
On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> Hi Ashutosh,
>
> I am also interested in this thread. And was working on a patch for it.
>
> On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
> wrote:
> >
> > Hi Amit,
> >
> > On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>
> >> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
> wrote:
> >> >
> >> > We have seen cases where slot synchronization gets delayed, for
> example when the slot is behind the failover standby or vice versa, and the
> slot sync worker has to wait for one to catch up with the other. During
> this waiting period, users querying pg_replication_slots can only see
> whether the slot has been synchronized or not. If it has already
> synchronized, that’s fine, but if synchronization is taking longer, users
> would naturally want to understand the reason for the delay.
> >> >
> >> > Is there a way for end users to know the cause of slot
> synchronization delays, so they can take appropriate actions to speed it up?
> >> >
> >> > I understand that server logs are emitted in such cases, but logs are
> not something end users would want to check regularly. Moreover, since
> logging is configuration-based, relevant messages may sometimes be skipped
> or suppressed.
> >> >
> >>
> >> Currently, the way to see the reason for sync skip is LOGs but I think
> >> it is better to add a new column like sync_skip_reason in
> >> pg_replication_slots. This can show the reasons like
> >> standby_LSN_ahead_remote_LSN. I think ideally users can compare
> >> standby's slot LSN/XMIN with remote_slot being synced. Do you have any
> >> better ideas?
> >>
> >
> > I have similar thoughts, but for clarity, I’d like to outline some of
> the key steps I plan to take:
> >
> > Step 1: Define an enum for all possible reasons a slot persistence was
> skipped.
> >
> > /*
> > * Reasons why a replication slot sync was skipped.
> > */
> > typedef enum ReplicationSlotSyncSkipReason
> > {
> > RS_SYNC_SKIP_NONE = 0, /* No skip */
> >
> > RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind
> local reserved LSN */
> >
> > RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of
> remote, risk of data loss */
> >
> > RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a
> consistent snapshot */
> > } ReplicationSlotSyncSkipReason;
> >
> > --
> >
> I think we should also add the case when "remote_slot->confirmed_lsn >
> latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
> is still not flushed on the Standby). In this case as well we are
> skipping the slot sync.
>
Yes, we can include this case as well.
>
> > Step 2: Introduce new column to pg_replication_slots to store the skip
> reason
> >
> > /* Inside pg_replication_slots table */
> > ReplicationSlotSyncSkipReason slot_sync_skip_reason;
> >
> > --
> >
> As per the discussion [1], I think it is more of stat related data and
> we should add it in the pg_stat_replication_slots view. Also we can
> add columns for 'slot sync skip count' and 'last slot sync skip'.
> Thoughts?
>
It’s not a bad choice, but what makes it a bit confusing for me is that
some of the slot sync information is stored in pg_replication_slots, while
some is in pg_stat_replication_slots.
Is there a possibility that when an end user queries pg_replication_slots,
it shows a particular slot as synced, but querying
pg_stat_replication_slots instead reveals a sync skip reason, or the other
way around?
Moreover, these views are primary data sources for end users, and the
information is critical for their operations. Splitting related information
across multiple views could increase the complexity of their queries.
>
> > Step 3: Function to convert enum to human-readable string that can be
> stored in pg_replication_slots.
> >
> > /*
> > * Convert ReplicationSlotSyncSkipReason bitmask to human-readable
> string.
> > *
> > * Returns a palloc'd string; caller is responsible for freeing it.
> > */
> > static char *
> > replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason
> reason)
> > {
> > StringInfoData buf;
> > initStringInfo(&buf);
> >
> > if (reason == RS_SYNC_SKIP_NONE)
> > {
> > appendStringInfoString(&buf, "none");
> > return buf.data;
> > }
> >
> > if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
> > appendStringInfoString(&buf, "remote_behind|");
> > if (reason & RS_SYNC_SKIP_DATA_LOSS)
> > appendStringInfoString(&buf, "data_loss|");
> > if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
> > appendStringInfoString(&buf, "no_snapshot|");
> >
> > /* Remove trailing '|' */
> > if (buf.len > 0 && buf.data[buf.len - 1] == '|')
> > buf.data[buf.len - 1] = '\0';
> >
> > return buf.data;
> > }
> >
> > --
> >
> Why are we showing the cause of the slot sync delay as an aggregate of
> all causes occuring? I thought we should show the reason for the last
> slot sync delay?
>
Yes we should just be showing the reason for the last sync skip, no
aggregation is needed here.
>
> > Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages
> are generated, primarily inside update_local_synced_slot or
> update_and_persist_local_synced_slot. This value will can later be
> persisted in the pg_replication_slots catalog.
> >
> > --
> >
> > Please let me know if you have any objections. I’ll share the wip patch
> in a few days.
> >
> > --
> I have attached a patch which I have worked on.
>
Thanks, I will look into it, in fact I have already looked into it, but
before I make any comments, I think maybe we should try to finalize the
approach first.
--
With Regards,
Ashutosh Sharma.
From | Date | Subject | |
---|---|---|---|
Next Message | Philip Warner | 2025-09-05 07:22:45 | Re: Appetite for syntactic sugar to match result set columns to UDT fields? |
Previous Message | Nazir Bilal Yavuz | 2025-09-05 07:09:27 | Differential Code Coverage report for Postgres |