Deadlock detector fails to activate on a hot standby replica

From: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Deadlock detector fails to activate on a hot standby replica
Date: 2026-01-19 12:43:16
Message-ID: 44c24dcf-5710-410f-b1b6-d10b315f3d51@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Hackers,

The deadlock detection mechanism fails to activate when a deadlock occurs
between startup and backend processes on a hot standby replica, resulting in
unforeseen delays in the recovery. The deadlock may happen, when processing
XLOG_HEAP2_PRUNE_* messages. Automatic resolution of deadlocks remains
possible when reaching the specified max_standby_streaming_delay value, if it is
set. Sometimes this value is set to -1 which disables this timeout. This
issue appears consistently in versions 15 and later, when
log_startup_progress_interval was introduced.

The startup process notify the conflicting backend process to check for deadlocks
when deadlock_timeout is reached. It works in general, but doesn't work in some
scenarios. If to set deadlock_timeout to be greater than
log_startup_progress_interval, the deadlock detector will never be triggered,
but the startup process will wait for the the deadlock resolution until
max_standby_streaming_delay timeout is reached (if it is set).

It is reproducible with the attached tap test 900_startup_backend_deadlock.pl.
To reproduce, just copy this test into src/test/recovery/t and run it.

The problem seems to appear in timeout.c functionality, or in
ResolveRecoveryConflictWithBufferPin depending on how to understand the
semantics of the timeout api. The root cause - handle_sig_alarm (SIGALRM handler)
may be called when no active timeouts are reached. It sets the process latch
unconditionally, this, waking up the process.

The problem may be in an optimization when setitimer may not be called,
when the closest final time of active timeouts is greater than already set time.
The SIGARLM handler may be called when no active timeouts are reached.

Below is the scenatio when deadlock timeout is not activated:

(1) The startup process sets startup_progress_interval to 1000ms and continues
with the recovery of the received WAL.

(2) When processing XLOG_HEAP2_PRUNE_*, the startup process tries to lock the
buffer using LockBufferForCleanup that calls ResolveRecoveryConflictWithBufferPin.
The deadlock of startup and backend processes is possible (see
src/test/recovery/t/031_recovery_conflict.pl test). Image, we come to the deadlock.

(3) ResolveRecoveryConflictWithBufferPin sets deadlock timeout to 3000 ms and
waits for buffer pin to be unlocked or for the timeout using ProcWaitForSignal.

(4) When the startup process in ProcWaitForSignal, handle_sig_alarm is called
because startup_progress_interval is reached (the timeout was disabled, but
the real timer was not reset). It sets the process latch unconditionally and
reschedules timers - the current active timer will be rescheduled in ~2000 ms in
our case, if XLOG_HEAP2_PRUNE_ was received right after step (1).
It means, the next call of handle_sig_alarm will be in 2000 ms.

(5) ResolveRecoveryConflictWithBufferPin continues after ProcWaitForSignal,
disables all active timeouts and returns. LockBufferForCleanup sees that the
buffer is still locked and calls ResolveRecoveryConflictWithBufferPin again.

(6) ResolveRecoveryConflictWithBufferPin sets deadlock timeout to 3000 ms, but
the real timer is not changed - it will be triggered in 2000 ms. And, then,
wits for timeout in ProcWaitForSignal.

(7) The SIGALRM handler (handle_sig_alarm) is called in 2000 ms, it sets the
process latch, but the deadlock timeout is not yet reached. Once, it is not
reached, the startup process will not signal to the conflicting backend to check
for deadlocks. ResolveRecoveryConflictWithBufferPin resets all timeouts again
and transfer control to LockBufferForCleanup. The buffer is still locked, it
calls ResolveRecoveryConflictWithBufferPin again.

(8) And so on... The startup process will run forever. It will loop in
LockBufferForCleanup without any progress in recovery.

The problem is here - if an unforeseen SIGALRM is received before deadlock
timeout, it can lead to infinite loop in LockBufferForCleanup.

I see a couple of possible solutions:
1. Call seitimer every time when needed (see the demo patch [1]).
2. Redesign LockBufferForCleanup logic to support the cases when SIGALRM may
come unexpectedly.
3. Call SetLatch in handle_sig_alarm only if some timeout is reached.

The solution 1 is a simpler one, but it can not guarantee that some other
functionaly will set a timeout and will affect LockBufferForCleanup. The
solution 2 seems to be more robust, but it is harder to implement. Furthermore,
I can not exclude some other places, where the timeout functionality is used in
a wrong way. Solution 3 seems to be the simplest but there is an opinion, that
any SIGALRM should wake up the process (set the latch).

Any ideas?

[1] 900_startup_backend_deadlock.pl
[2] 0001-Fix-deadlock-detector-activation-in-startup-process.patch

Attachment Content-Type Size
900_startup_backend_deadlock.pl application/x-perl 4.4 KB
0001-Fix-deadlock-detector-activation-in-startup-process.patch text/x-patch 877 bytes

Browse pgsql-hackers by date

  From Date Subject
Previous Message Fujii Masao 2026-01-19 12:41:32 Re: Exit walsender before confirming remote flush in logical replication