| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> |
| Cc: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Deadlock detector fails to activate on a hot standby replica |
| Date: | 2026-07-02 09:31:33 |
| Message-ID: | CABPTF7VadASm0DaqS=npOF3kv5-yWkzt2m=3ziXyH+mzbRqLjQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Jul 1, 2026 at 11:58 PM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
>
> Dear Xuneng Zhou
>
> Thank you very much for such a comprehensive feedback.
>
> On 6/30/26 06:14, Xuneng Zhou wrote:
>
> 3)
>
> > Another potential issue of doing safe checks here is the order of
> > assert and error out:
> > In assert build, if the condition is true, then the header lock and
> > content lock would not be released properly. So we may need to adjust
> > the order.
>
> > This leads me to wonder -- would it be better to let this help
> > function to do less, like just for registration and let the callers to
> > handle others.
>
> I agree with your comment. Furthermore, I do not like the approach with
> locking and unlocking in different functions. But, I see this approach
> is used in some other places in the code. I should think how to improve it.
>
> I have some doubts about the code where LockBufferForCleanup (bufmgr.c)
> calls ResolveRecoveryConflictWithBufferPin (standby.c), but the latter
> function calls BufferIsReadyForCleanup (bufmgr.c). There is an idea to
> refactor it in the future, because these functions are closely coupled.
> May be unite them or move ResolveRecoveryConflictWithBufferPin into
> bufmgr.c...
I agree that letting the resolver own the wait for loop is not very
ideal. The ideal design would be the code that owns the wait predicate
owns the wait loop(bufmgr). But there seems no easy way to do it.
> 4) the already-expired standby-delay path may not wake new pin holders
>
> Agree, thank you. It seems it is a kind of an optimization to avoid
> activation of timeouts. Is it enought to return after
> SendRecoveryConflictWithBufferPin, once we sent a signal to a conflicting
> backend? Not sure, we have to wait with ProcWaitForSignal. Of course, the
> same check should be in the loop, when timeouts are already activated. If
> another backend pins the buffer after the broadcast, it will be handled
> in the parent function.
Yeah, returning right away after the initial
SendRecoveryConflictWithBufferPin() is a possible way to fix the
issue, but it may cause repeated broadcasts / busy retrying with the
current design.
> 5) The name of BufferIsReadyForCleanup seems broader than the actual protocol.
>
> What about to move RegisterPinCountWaiter out of this function? May be, rename
> it into BufferGetRefCount() or something else like it? In this case, its
> semantics will be simple and well-defined, and it can be re-used in some other
> places. Another thing is to add a secong function argument like
> doRegisterWaiter.
>
> We may also remove BufferIsReadyForCleanup but put its code into
> ResolveRecoveryConflictWithBufferPin.
>
> I have no good solution right now. Let me please think about it.
>
> 6)
>
> > From personal experience, the message seems a bit hard to follow if
> > one does not read the related discussion. I like your analysis of the
> > problem at the start of this thread [1]. Would it be sensible to
> > summarize and reuse some part of it for the message?
>
> > The motivation of this patch is to fix deadlock detector activation,
> > but the wait-loop change has a slightly broader benefit. It also
> > improves the standby timeout path indirectly by avoiding needless
> > timeout teardown and re-entry on unrelated wakeups, and by keeping any
> > already-fired timeout state until the loop handles it. Should we
> > mention this as an additional benefit, or even frame the message
> > broader like:
> > Fix the buffer-pin recovery-conflict wait loop.
>
> Thank you for such a detailed comment. I would like to propose the following
> commit message:
This version seems better to me in general.
> Fix recovery-conflict wait loop for buffer pins on hot standby
>
> When a deadlock occurs between the startup process and a backend during
> recovery on a hot standby, the deadlock detector may fail to activate if
> deadlock_timeout exceeds log_startup_progress_interval. Due to an
> optimization in timeout.c that avoids redundant setitimer() calls,
> periodic SIGALRM signals from the progress timer can arrive before the
> deadlock timeout expires. These signals wake the startup process
> prematurely, causing ResolveRecoveryConflictWithBufferPin to reset all
> timeouts and retry without ever triggering the deadlock detector. This
> results in an infinite loop in LockBufferForCleanup and stalls recovery
> indefinitely, particularly when max_standby_streaming_delay = -1.
>
> The timeout.c subsystem uses an optimization that avoids redundant
> setitimer() calls when the nearest active timeout's expiration time is
> later than the currently armed timer. As a result, a SIGALRM from a
> previously set timer (e.g., log_startup_progress_interval) can fire when
> no currently active timeout has actually been reached. This periodic
> signal wakes ResolveRecoveryConflictWithBufferPin prematurely, resetting
> the timeout loop without ever triggering the deadlock detector at
> deadlock_timeout. The deadlock timeout is perpetually re-armed but never
> fires because each periodic SIGALRM arrives before it expires.
These two paras seem repetitive to state the same thing. We might want
to consolidate them. Another thing is should we focus less on the
specific issue of log_startup_progress_interval? It's one of the
sources that generate unrelated wake-ups that lead to this issue. The
deeper issue is the buffer-pin wait code treats any latch wakeup as if
it were a relevant buffer-pin wait event.
> This patch fixes the issue by making ResolveRecoveryConflictWithBufferPin
> ignore unintended SIGALRM signals that do not correspond to any currently
> enabled timeout, ensuring the deadlock timeout fires correctly and the
> conflicting backend is signaled to check for deadlocks.
Unintended SIGALRM signals are the thing we try to ignore. It looks
more like the end rather than the means. What we do seems broader: we
ignore all unrelated latch wake-ups and only react to the genuine
ready signal we care about.
> It improves the standby timeout path by reducing unnecessary timeout
> teardown and re-arming on unrelated wakeups, preserving the activated
> timer state until the wait loop completes. This makes the deadlock
> timeout behavior consistent with user expectations.
>
>
> It seems, the patch should be completely rewritten. I will try to
> prepare a new patch version.
Thanks for that.
--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-07-02 09:34:28 | Re: Escape CR/LF in invalid database, role, and tablespace name errors |
| Previous Message | Jakub Wartak | 2026-07-02 09:24:21 | Re: Adding basic NUMA awareness |