Re: How can end users know the cause of LR slot sync delays?

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: How can end users know the cause of LR slot sync delays?
Date: 2025-09-03 12:22:21
Message-ID: CAE9k0Pmh86ctxaOQ0QZkt0gmg+pJbu34w-maG=NoJXfbR80hoA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
> wrote:
> >
> > We have seen cases where slot synchronization gets delayed, for example
> when the slot is behind the failover standby or vice versa, and the slot
> sync worker has to wait for one to catch up with the other. During this
> waiting period, users querying pg_replication_slots can only see whether
> the slot has been synchronized or not. If it has already synchronized,
> that’s fine, but if synchronization is taking longer, users would naturally
> want to understand the reason for the delay.
> >
> > Is there a way for end users to know the cause of slot synchronization
> delays, so they can take appropriate actions to speed it up?
> >
> > I understand that server logs are emitted in such cases, but logs are
> not something end users would want to check regularly. Moreover, since
> logging is configuration-based, relevant messages may sometimes be skipped
> or suppressed.
> >
>
> Currently, the way to see the reason for sync skip is LOGs but I think
> it is better to add a new column like sync_skip_reason in
> pg_replication_slots. This can show the reasons like
> standby_LSN_ahead_remote_LSN. I think ideally users can compare
> standby's slot LSN/XMIN with remote_slot being synced. Do you have any
> better ideas?
>
>
I have similar thoughts, but for clarity, I’d like to outline some of the
key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was
skipped.

/*
* Reasons why a replication slot sync was skipped.
*/
typedef enum ReplicationSlotSyncSkipReason
{
RS_SYNC_SKIP_NONE = 0, /* No skip */

RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local
reserved LSN */

RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote,
risk of data loss */

RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a
consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

Step 2: Introduce new column to pg_replication_slots to store the skip
reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

Step 3: Function to convert enum to human-readable string that can be
stored in pg_replication_slots.

/*
* Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
*
* Returns a palloc'd string; caller is responsible for freeing it.
*/
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
StringInfoData buf;
initStringInfo(&buf);

if (reason == RS_SYNC_SKIP_NONE)
{
appendStringInfoString(&buf, "none");
return buf.data;
}

if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
appendStringInfoString(&buf, "remote_behind|");
if (reason & RS_SYNC_SKIP_DATA_LOSS)
appendStringInfoString(&buf, "data_loss|");
if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
appendStringInfoString(&buf, "no_snapshot|");

/* Remove trailing '|' */
if (buf.len > 0 && buf.data[buf.len - 1] == '|')
buf.data[buf.len - 1] = '\0';

return buf.data;
}

--

Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages
are generated, primarily inside update_local_synced_slot or
update_and_persist_local_synced_slot. This value will can later be
persisted in the pg_replication_slots catalog.

--

Please let me know if you have any objections. I’ll share the wip patch in
a few days.

--
With Regards,
Ashutosh Sharma.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Doruk Yilmaz 2025-09-03 12:43:25 Re: [Patch] add new parameter to pg_replication_origin_session_setup
Previous Message Florents Tselai 2025-09-03 12:16:35 Re: split func.sgml to separated individual sgml files