| From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
|---|---|
| To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
| Subject: | Re: Fix slotsync worker busy loop causing repeated log messages |
| Date: | 2026-04-10 09:52:34 |
| Message-ID: | CAA4eK1++dGmvReJ4pRw09qdrUED=iUfyiEM-fN03D2VVK1de1A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Sunday, March 22, 2026 1:47 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
> > wrote:
> > >
> > > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> > > >
> > > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
> > wrote:
> > > > >
> > > > > Normally, the slotsync worker updates the standby slot using the
> > > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > > but restart_lsn does not, the worker does not actually update the
> > > > > standby slot. Despite that, the current code of
> > > > > update_local_synced_slot() appears to treat this situation as if
> > > > > an update occurred. As a result, the worker sleeps only for the
> > > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > > again assumes an update happened, and continues looping with the
> > > > > short sleep interval, causing the repeated logical decoding log messages.
> > Based on a quick analysis, this seems to be the root cause.
> > > > >
> > > > > I think update_local_synced_slot() should return false (i.e., no
> > > > > update
> > > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > > differs between primary and standby.
> > > > >
> > > >
> > > > We expect that in such a case update_local_synced_slot() should
> > > > advance local_slot's 'restart_lsn' via
> > > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > > > cheap code path next time. Normally, restart_lsn advancement should
> > > > happen when we process XLOG_RUNNING_XACTS and call
> > > > SnapBuildProcessRunningXacts(). In this particular case as both
> > > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > > > machinery may not be processing XLOG_RUNNING_XACTS record. I have
> > > > not debugged the exact case yet but you can try by emitting some
> > > > more records on publisher, it should let the standby advance the
> > > > slot. It is possible that we can do something like you are proposing
> > > > to silence the LOG messages but we should know what is going on here.
> > >
> > > I agree it's important to investigate why restart_lsn doesn't advance
> > > in that case and fix that issue.
> > >
> > > Separately, I think mistreating the slot as updated is another problem:
> > > the local slot is treated as updated even when it isn't, which causes
> > > the slotsync worker to use the minimum sleep interval. We should
> > > address this independently.
> > >
> > > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
> > > slot, but update_local_synced_slot() currently assumes it does. That
> > > seems to be the root cause. The attached 0001 patch makes
> > > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
> > > actually updated, and updates update_local_synced_slot() to act based on
> > that result.
> > >
> > > I've also attached 0002, which reduces the log level of some logical
> > > decoding messages (e.g., "logical decoding found consistent point")
> > > from LOG to DEBUG1. These are low-level, developer-oriented messages
> > > and can be noisy when they occur regularly. This is the same issue I
> > reported earlier at [1].
> > >
> > > How about applying these patches while we continue working on the
> > > restart_lsn issue?
> > >
> >
> > *
> > @@ -2174,7 +2193,10 @@
> > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> >
> > if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
> > {
> > - LogicalConfirmReceivedLocation(moveto);
> > + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> > +
> > + if (updated_xmin_or_lsn)
> > + *updated_xmin_or_lsn = slot_updated;
> >
> > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> > processing running_xact record, so not sure we can rely only on the explicit
> > call LogicalConfirmReceivedLocation() above to ascertain the same.
>
> Perhaps we could simply compare the slot's old and new LSN/xmin to determine
> whether updated_xmin_or_lsn needs to be set.
>
The patch works for me. Can we change the comment to as follows: "It
is possible that the slot's xmin or LSNs are not updated, when the
synced slot has reached consistent snapshot state or cannot build the
one at all."?
--
With Regards,
Amit Kapila.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Kapila | 2026-04-10 09:54:55 | Re: pgsql: Reduce log level of some logical decoding messages from LOG to D |
| Previous Message | Tatsuo Ishii | 2026-04-10 09:35:15 | Re: Row pattern recognition |