| From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
| Subject: | Re: How can end users know the cause of LR slot sync delays? |
| Date: | 2025-11-12 10:55:32 |
| Message-ID: | CANhcyEXEyUt8TycOqkgdWT+NeJASM=1acU1dK64UNsJeKMQFQA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 5 Nov 2025 at 11:49, shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Tue, Nov 4, 2025 at 3:17 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 3, 2025 at 3:14 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > >
> > > > I have attached the latest patch.
> > > >
> > >
> > > Thanks. I have started going through it.
> > >
> > > I’m not sure if this has already been discussed; I couldn’t find any
> > > mention of it in the thread. Why don’t we persist
> > > 'slot_sync_skip_reason' (it is outside of
> > > ReplicationSlotPersistentData)? If a slot wasn’t synced during the
> > > last cycle and the server restarts, it would be helpful to know the
> > > reason it wasn’t synced prior to the node restart.
> > >
Actually I did not think in this direction. I think it will be useful
to persist 'slot_sync_skip_reason'. I have made the change for the
same in the latest patch.
> >
> > Please find a few more comments:
> >
> > 1)
> > last_slot_sync_skip
> >
> > Will 'last_slotsync_skip_at' be a better name?
> >
> > Since we refer worker as slotsync in docs, I feel slotsync seems a
> > more natural choice than slot_sync. Also '_at' gives clarity that it
> > is about time rather than a boolean (which currently it seems like).
> >
> > Same goes for slot_sync_skip_count and slot_sync_skip_reason. Shall
> > these be slotsync_skip_count and slotsync_skip_reason.
> >
Fixed it.
> > 2)
> > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason
> > skip_reason,
> > + bool acquire_slot)
> >
> > It looks like there is only one caller that passes acquire_slot as
> > true, while all others pass false. Instead of keeping the acquire_slot
> > parameter, will it be better if we remove it and add an
> > Assert(MyReplication) to ensure the slot is already acquired? We can
> > add a comment stating that this function expects the slot to be
> > acquired by the caller. The one caller that currently passes
> > acquire_slot = true can acquire the slot explicitly before invoking
> > this function. Thoughts?
This idea looks good to me. I have updated the patch accordingly.
> >
> > 3)
> > In update_and_persist_local_synced_slot(), we get both
> > 'found_consistent_snapshot' and 'remote_slot_precedes' from
> > update_local_synced_slot(). But skipsync-reason for
> > 'remote_slot_precedes' is updated inside update_local_synced_slot()
> > while skipsync-reason for '!found_consistent_snapshot' is updated in
> > caller update_and_persist_local_synced_slot. Is there a reason for
> > that?
> >
update_and_persist_local_synced_slot is called when the synced slot
is in temporary state and we are calling function
'update_local_synced_slot' directly for permanent slots.Slot sync
skip when "remote_slot_precedes" is true can happen for both permanent
and temporary slot. So I think we need to update the stats in
"update_local_synced_slot"
Whereas we are skipping slot sync only for temporary slots when a
consistent snapshot is not found. So I added this in the function
"update_and_persist_local_synced_slot".
> > 4)
> > What about the case where the slot is invalidated and sync is skipped?
> > I do not see any stats for that. See 'Skip the sync of an invalidated
> > slot' in synchronize_one_slot(). If it is already discussed and
> > concluded, please add a comment.
> >
It was not discussed earlier.
The pg_replication_slots already have a column name
'invalidation_reason'. And when the remote slot is invalidated the
local slot's is also invalidated. So, should we be required to
maintain this in 'slotsync_skip_reason' as well? I think it would be
kind of redundant? Thoughts?
>
> Few more on 001:
>
> 5)
> The name SS_SKIP_MISSING_WAL_RECORD doesn’t seem appropriate. It
> sounds more like some WAL issue, rather than indicating that the WAL
> hasn’t been flushed. A better name could be SS_SKIP_WAL_NOT_FLUSHED.
>
I think that the suggested name is better. I have updated the patch accordingly.
> 6)
> Instead of calling 'update_slot_sync_skip_stats' at multiple places,
> how about we just update the skip_reason everywhere and make a call to
> 'update_slot_sync_skip_stats' only in synchronize_one_slot(). IMO,
> that will look cleaner. Thoughts?
>
I tried this approach in [1] (see v2_approach2). This approach would
require passing extra parameters to the functions.
Here Amit suggested that we should try an approach where this can be
avoided. So, I came up with the current approach. See [2].
I have addressed the comments and attached the updated v7 patch.
[1]: https://www.postgresql.org/message-id/CANhcyEXHcdoRRo0N0uib-t7mfkbotv%3DaYjAWAekDAbHCRe%2BBng%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/CAA4eK1KZLPxv7VBZf%3DBp9%3D-pzKNfvNmFDqFYYzwkowE4FpRs1A%40mail.gmail.com
Thanks,
Shlok Kyal
| Attachment | Content-Type | Size |
|---|---|---|
| v7-0001-Add-stats-related-to-slot-sync-skip.patch | application/octet-stream | 25.6 KB |
| v7-0002-Add-test-for-new-stats-for-slot-sync-skip.patch | application/octet-stream | 7.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dagfinn Ilmari Mannsåker | 2025-11-12 10:55:47 | Re: [Patch] Windows relation extension failure at 2GB and 4GB |
| Previous Message | Yugo Nagata | 2025-11-12 10:49:36 | Re: Make PQgetResult() not return NULL on out-of-memory error |