RE: Fix slotsync worker busy loop causing repeated log messages

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Fix slotsync worker busy loop causing repeated log messages
Date: 2026-03-03 07:42:31
Message-ID: OS7PR01MB16909C13530D84781E7C2E2EF947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Saturday, February 28, 2026 1:03 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >
> > Normally, the slotsync worker updates the standby slot using the
> > primary's slot state. However, when confirmed_flush_lsn matches but
> > restart_lsn does not, the worker does not actually update the standby
> > slot. Despite that, the current code of update_local_synced_slot()
> > appears to treat this situation as if an update occurred. As a result,
> > the worker sleeps only for the minimum interval (200 ms) before
> > retrying. In the next cycle, it again assumes an update happened, and
> > continues looping with the short sleep interval, causing the repeated
> > logical decoding log messages. Based on a quick analysis, this seems to be
> the root cause.
> >
> > I think update_local_synced_slot() should return false (i.e., no
> > update
> > happened) when confirmed_flush_lsn is equal but restart_lsn differs
> > between primary and standby.
> >
>
> We expect that in such a case update_local_synced_slot() should advance
> local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
> otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
> advancement should happen when we process XLOG_RUNNING_XACTS and
> call SnapBuildProcessRunningXacts(). In this particular case as both
> restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> debugged the exact case yet but you can try by emitting some more records
> on publisher, it should let the standby advance the slot. It is possible that we
> can do something like you are proposing to silence the LOG messages but we
> should know what is going on here.

I reproduced and debugged this issue where a replication slot's restart_lsn
fails to advance. In my environment, I found it only occurs when a synced
slot first builds a consistent snapshot. The problematic code path is in
SnapBuildProcessRunningXacts():

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* returns false if there's no point in performing cleanup just yet */
if (!SnapBuildFindSnapshot(builder, lsn, running))
return;
}

When a synced slot reaches consistency for the first time with no running
transactions, SnapBuildFindSnapshot() returns false, causing the function to
return without updating the candidate restart_lsn.

So, an alternative approach is to improve this logic by updating the candidate
restart_lsn in this case instead of returning early. See the attached patch for
details. This can fix the issue on my machine.

Best Regards,
Hou zj

Attachment Content-Type Size
v1-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch application/octet-stream 3.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2026-03-03 07:42:59 Re: POC: PLpgSQL FOREACH IN JSON ARRAY
Previous Message Zsolt Parragi 2026-03-03 07:40:22 Re: pg_dumpall --roles-only interact with other options