Re: Fix slotsync worker busy loop causing repeated log messages

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Fix slotsync worker busy loop causing repeated log messages
Date: 2026-04-10 06:01:14
Message-ID: CAJpy0uARs-jVYuPU6Wg3jTYFZiYEKMx30CPBM9XXvNnF=4xqeA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Sunday, March 22, 2026 1:47 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
> > wrote:
> > >
> > > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> > > >
> > > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
> > wrote:
> > > > >
> > > > > Normally, the slotsync worker updates the standby slot using the
> > > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > > but restart_lsn does not, the worker does not actually update the
> > > > > standby slot. Despite that, the current code of
> > > > > update_local_synced_slot() appears to treat this situation as if
> > > > > an update occurred. As a result, the worker sleeps only for the
> > > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > > again assumes an update happened, and continues looping with the
> > > > > short sleep interval, causing the repeated logical decoding log messages.
> > Based on a quick analysis, this seems to be the root cause.
> > > > >
> > > > > I think update_local_synced_slot() should return false (i.e., no
> > > > > update
> > > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > > differs between primary and standby.
> > > > >
> > > >
> > > > We expect that in such a case update_local_synced_slot() should
> > > > advance local_slot's 'restart_lsn' via
> > > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > > > cheap code path next time. Normally, restart_lsn advancement should
> > > > happen when we process XLOG_RUNNING_XACTS and call
> > > > SnapBuildProcessRunningXacts(). In this particular case as both
> > > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > > > machinery may not be processing XLOG_RUNNING_XACTS record. I have
> > > > not debugged the exact case yet but you can try by emitting some
> > > > more records on publisher, it should let the standby advance the
> > > > slot. It is possible that we can do something like you are proposing
> > > > to silence the LOG messages but we should know what is going on here.
> > >
> > > I agree it's important to investigate why restart_lsn doesn't advance
> > > in that case and fix that issue.
> > >
> > > Separately, I think mistreating the slot as updated is another problem:
> > > the local slot is treated as updated even when it isn't, which causes
> > > the slotsync worker to use the minimum sleep interval. We should
> > > address this independently.
> > >
> > > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
> > > slot, but update_local_synced_slot() currently assumes it does. That
> > > seems to be the root cause. The attached 0001 patch makes
> > > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
> > > actually updated, and updates update_local_synced_slot() to act based on
> > that result.
> > >
> > > I've also attached 0002, which reduces the log level of some logical
> > > decoding messages (e.g., "logical decoding found consistent point")
> > > from LOG to DEBUG1. These are low-level, developer-oriented messages
> > > and can be noisy when they occur regularly. This is the same issue I
> > reported earlier at [1].
> > >
> > > How about applying these patches while we continue working on the
> > > restart_lsn issue?
> > >
> >
> > *
> > @@ -2174,7 +2193,10 @@
> > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> >
> > if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
> > {
> > - LogicalConfirmReceivedLocation(moveto);
> > + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> > +
> > + if (updated_xmin_or_lsn)
> > + *updated_xmin_or_lsn = slot_updated;
> >
> > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> > processing running_xact record, so not sure we can rely only on the explicit
> > call LogicalConfirmReceivedLocation() above to ascertain the same.
>
> Perhaps we could simply compare the slot's old and new LSN/xmin to determine
> whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
> into each individual update path at the lower level. Attaching a patch for reference.
>
> >
> > Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
> > fix this issue but I'll look at it after the feature freeze.
>
> The other patches attempt to advance restart_lsn in more cases by modifying the
> logical decoding logic. However, those scenarios are relatively rare and may not
> produce significant improvement outside the slot synchronization context. So, I
> think modifying the slot sync worker to correctly increase the polling interval instead
> is also a good starting point.
>

After reviewing the entire thread and patches, the above solution
appears to be the simplest and least intrusive, especially since the
issue is limited to the sltosync worker. So +1 for this idea.

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Roman Khapov 2026-04-10 06:05:08 Re: Additional message in pg_terminate_backend
Previous Message Chao Li 2026-04-10 05:53:15 Fix pgstat_database.c to honor passed database OIDs