| From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Fix slotsync worker busy loop causing repeated log messages |
| Date: | 2026-03-19 13:38:33 |
| Message-ID: | CAHGQGwG1br8SQQnBieyJeRAmnMejEtJrVmurfTPxNouLHSW=jA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >
> > Normally, the slotsync worker updates the standby slot using the primary's slot
> > state. However, when confirmed_flush_lsn matches but restart_lsn does not,
> > the worker does not actually update the standby slot. Despite that, the current
> > code of update_local_synced_slot() appears to treat this situation as if
> > an update occurred. As a result, the worker sleeps only for the minimum
> > interval (200 ms) before retrying. In the next cycle, it again assumes
> > an update happened, and continues looping with the short sleep interval,
> > causing the repeated logical decoding log messages. Based on a quick analysis,
> > this seems to be the root cause.
> >
> > I think update_local_synced_slot() should return false (i.e., no update
> > happened) when confirmed_flush_lsn is equal but restart_lsn differs between
> > primary and standby.
> >
>
> We expect that in such a case update_local_synced_slot() should
> advance local_slot's 'restart_lsn' via
> LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> cheap code path next time. Normally, restart_lsn advancement should
> happen when we process XLOG_RUNNING_XACTS and call
> SnapBuildProcessRunningXacts(). In this particular case as both
> restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> debugged the exact case yet but you can try by emitting some more
> records on publisher, it should let the standby advance the slot. It
> is possible that we can do something like you are proposing to silence
> the LOG messages but we should know what is going on here.
I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.
Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.
LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local slot,
but update_local_synced_slot() currently assumes it does. That seems to be
the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was actually
updated, and updates update_local_synced_slot() to act based on that result.
I've also attached 0002, which reduces the log level of some logical decoding
messages (e.g., "logical decoding found consistent point") from LOG to
DEBUG1. These are low-level, developer-oriented messages and can be noisy
when they occur regularly. This is the same issue I reported earlier at [1].
How about applying these patches while we continue working on
the restart_lsn issue?
Regards,
[1] https://postgr.es/m/CAHGQGwED5kJxZ_MdCh=WLa5M7ekXdnzCO1ZCQhQNCEdfaEPwFQ@mail.gmail.com
--
Fujii Masao
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch | application/octet-stream | 6.6 KB |
| v2-0002-Reduce-log-level-of-some-logical-decoding-message.patch | application/octet-stream | 3.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Daniel Gustafsson | 2026-03-19 13:49:21 | Re: Changing the state of data checksums in a running cluster |
| Previous Message | Robert Haas | 2026-03-19 13:36:36 | Re: Better shared data structure management and resizable shared data structures |