RE: How can end users know the cause of LR slot sync delays?

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Shlok Kyal' <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Subject: RE: How can end users know the cause of LR slot sync delays?
Date: 2025-09-17 10:54:39
Message-ID: OSCPR01MB14966A618A8C61EC3DEE486A4F517A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Shlok,

Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
indicate the current status of synchronization, it just shows the history.
I feel approach2 has more information than approach1.

Below are my comments for your patch.

01.
```
+ Number of times the slot sync is skipped.
```

"slot sync" should be "slot synchronization". Same thing can be saied for other
attributes.

02.
```
+ Reason of the last slot sync skip.
```

Possible values must be clarified.

03.
```
+ s.slot_sync_skip_count,
+ s.last_slot_sync_skip,
+ s.slot_sync_skip_reason,
```

Last line has tab-blank but others have space-blank.

04.
```
+typedef enum SlotSyncSkipReason
+{
+ SLOT_SYNC_SKIP_NONE, /* No skip */
+ SLOT_SYNC_SKIP_STANDBY_BEHIND, /* Standby is behind the remote slot */
+ SLOT_SYNC_SKIP_REMOTE_BEHIND, /* Remote slot is behind the local slot */
+ SLOT_SYNC_SKIP_NO_CONSISTENT_SNAPSHOT /* Standby could not reach a
+ * consistent snapshot */
+} SlotSyncSkipReason
```

a.
Can we add comment atop the enum?

b.
SLOT_SYNC_SKIP_STANDBY_BEHIND is misleading; it indicates the standby server has
not received WALs required by slots, but enum does not clarify it.
How about SLOT_SYNC_SKIP_MISSING_WAL_RECORDS or something? Better naming are very
welcome.

c
s/reach/build/.

05.
```
@@ -646,11 +652,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
remote_slot->name,
LSN_FORMAT_ARGS(latestFlushPtr)));

- return false;
+ /* If the slot is not present on the local */
+ if (!(slot = SearchNamedReplicationSlot(remote_slot->name, true)))
+ return false;
}
```

Looks like if users try to sync slots via SQL interfaces, the statistics cannot
be updated.

06. update_slot_sync_skip_stats
```
+ /* Update the slot sync reason */
+ SpinLockAcquire(&slot->mutex);
+ slot->slot_sync_skip_reason = skip_reason;
+ SpinLockRelease(&slot->mutex);
```

I feel no need to update the reason if the slot->slot_sync_skip_reason
and skip_reason are the same.

07. synchronize_one_slot
```
+ /*
+ * If standby is behind remote slot and the synced slot is present on
+ * local.
+ */
+ if (remote_slot->confirmed_lsn > latestFlushPtr)
+ {
+ if (synced)
+ update_slot_sync_skip_stats(slot, skip_reason);
+ return false;
+ }
```

This condition exist in the same function; can we combine?

08. GetSlotSyncSkipReason()

Do we have to do pstrdup() here? I found a similar function get_snapbuild_state_desc(),
and it does not use.

09.

Can you consider a test for added codes?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2025-09-17 10:58:33 Re: PG 18 release notes draft committed
Previous Message Etsuro Fujita 2025-09-17 10:53:05 Re: someone else to do the list of acknowledgments