RE: Fix slotsync worker busy loop causing repeated log messages

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Fix slotsync worker busy loop causing repeated log messages
Date: 2026-03-03 12:24:33
Message-ID: OS7PR01MB169097E45A7A84AF631534306947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, March 3, 2026 6:25 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Saturday, February 28, 2026 1:03 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
> wrote:
> > > >
> > > > Normally, the slotsync worker updates the standby slot using the
> > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > but restart_lsn does not, the worker does not actually update the
> > > > standby slot. Despite that, the current code of
> > > > update_local_synced_slot() appears to treat this situation as if
> > > > an update occurred. As a result, the worker sleeps only for the
> > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > again assumes an update happened, and continues looping with the
> > > > short sleep interval, causing the repeated logical decoding log
> > > > messages. Based on a quick analysis, this seems to be
> > > the root cause.
> > > >
> > > > I think update_local_synced_slot() should return false (i.e., no
> > > > update
> > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > differs between primary and standby.
> > > >
> > >
> > > We expect that in such a case update_local_synced_slot() should
> > > advance local_slot's 'restart_lsn' via
> > > LogicalSlotAdvanceAndCheckSnapState(),
> > > otherwise, it won't go in the cheap code path next time. Normally,
> > > restart_lsn advancement should happen when we process
> > > XLOG_RUNNING_XACTS and call SnapBuildProcessRunningXacts(). In this
> > > particular case as both restart_lsn and confirmed_flush_lsn are the
> > > same (0/03000140), the machinery may not be processing
> > > XLOG_RUNNING_XACTS record. I have not debugged the exact case yet
> > > but you can try by emitting some more records on publisher, it
> > > should let the standby advance the slot. It is possible that we can
> > > do something like you are proposing to silence the LOG messages but we
> should know what is going on here.
> >
> > I reproduced and debugged this issue where a replication slot's
> > restart_lsn fails to advance. In my environment, I found it only
> > occurs when a synced slot first builds a consistent snapshot. The
> > problematic code path is in
> > SnapBuildProcessRunningXacts():
> >
> > if (builder->state < SNAPBUILD_CONSISTENT)
> > {
> > /* returns false if there's no point in performing cleanup just yet */
> > if (!SnapBuildFindSnapshot(builder, lsn, running))
> > return;
> > }
> >
> > When a synced slot reaches consistency for the first time with no
> > running transactions, SnapBuildFindSnapshot() returns false, causing
> > the function to return without updating the candidate restart_lsn.
> >
> > So, an alternative approach is to improve this logic by updating the
> > candidate restart_lsn in this case instead of returning early.
> >
>
> But why not return 'true' from SnapBuildFindSnapshot() in that case?
> The comment atop SnapBuildFindSnapshot() says: "Returns true if there is a
> point in performing internal maintenance/cleanup using the xl_running_xacts
> record.". Doesn't updating restart_lsn fall under that category?

After re-thinking the return value semantics, I realized that
SnapBuildFindSnapshot() may no longer need a return value. In the revised
logic, the function would always return true, making the return value
redundant.

To elaborate: the original SnapBuildFindSnapshot() returned:
- true: when snapshot is built incrementally (after waiting for running
transactions to complete)
- false: when snapshot is built immediately (restoring from disk or no
running transactions)

My initial thought was to use this return value to decide whether to advance
restart_lsn to the latest LSN (advance when restoring a snapshot or with no running
transactions). However, this decision can be made without relying on the
return value (please see the v2 patch).

Therefore, we can change the function return type to void. The only other
consumer of this return value is SnapBuildPurgeOlderTxn(), which can instead
skip cleanup when no transaction data has accumulated (i.e., when the number
of committed transactions is zero), eliminating the need for the return value
altogether.

> However, I have a question that even if we haven't incremented it in the first
> cycle, why is it not incrementing restart_lsn in consecutive sync cycles.

Because no walsender was active during the reproduction step, so the slot remained
inactive on the publisher and its restart_lsn didn't advanced. As a result, the
slotsync process stalled while continuously retrying the first cycle of
snapshot building.

Best Regards,
Hou zj

Attachment Content-Type Size
v2-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch application/octet-stream 6.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2026-03-03 13:45:07 Re: POC: PLpgSQL FOREACH IN JSON ARRAY
Previous Message Álvaro Herrera 2026-03-03 12:12:38 Re: Cleanup shadows variable warnings, round 1