Re: [PATCH] Prevent repeated deadlock-check signals in standby buffer pin waits

From: Henson Choi <assam258(at)gmail(dot)com>
To: JoongHyuk Shin <sjh910805(at)gmail(dot)com>
Cc: Xuneng Zhou <xunengzhou(at)gmail(dot)com>, Ilmar Yunusov <tanswis42(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Prevent repeated deadlock-check signals in standby buffer pin waits
Date: 2026-06-26 17:11:12
Message-ID: CAAAe_zCSPTg3bY2Ex=r9BscOJzoMqvKN3Yqp3xTKAB=X82hboA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I'm fairly new to this part of the tree, so before reviewing the patch
itself
I spent some time on the surrounding code to make sure I actually understand
the setting. Let me lay out that understanding first -- I'd appreciate
corrections where I have it wrong.

The question I kept coming back to is: why is the hot-standby case the only
one
that needs all this machinery, when the primary takes a single, untimed
ProcWaitForSignal(), and a cold standby never runs into the contention at
all?

My understanding of the chain:

- A cleanup lock is stronger than an exclusive content lock: it also
requires
sole pinning (refcount == 1). Its purpose is to let prune/VACUUM
physically
remove tuples and compact the page, so there must be no other backend
still
referencing a tuple in that page by pointer -- and a pin is how that
presence
is detected.

- On the primary, the waiter is just VACUUM/prune, and a stall delays only
that
one operation -- the rest of the system runs on -- so waiting indefinitely
(the untimed ProcWaitForSignal) is harmless.

- A cold standby (hot_standby = off) has no read-only queries, hence no
query
backend pinning the page to wait for, so it never even waits -- simpler
still
than the primary.

The special case is the hot standby.

- On a standby, VACUUM (and prune) is also carried out by WAL replay -- the
cleanup that happened on the primary arrives as a WAL record, which the
startup process replays under the cleanup lock.

- But a hot standby is also serving read-only queries, whose pins -- e.g. a
cursor holding a page across a client round-trip -- conflict with that
cleanup lock.

- Replay is a single serialized stream, so one stalled record stalls every
WAL
record behind it: the conflict is no longer isolated, it becomes
replication
lag for the whole standby.

- So the replay side cannot just wait -- it arms STANDBY_DEADLOCK_TIMEOUT
and
STANDBY_TIMEOUT, and ultimately cancels the pin holder at
max_standby_*_delay.

So the complexity is specific to doing two things at once on one node:
serving
queries and replaying a strictly ordered stream. The primary avoids it
because
its waits are isolated; the cold standby avoids it because it has no
queries.
The hot standby has neither escape, and this path is where that tension is
sharpest -- which is why a small change to it seems to warrant more careful
review than most.

Does the above match how others see it? Assuming it does, here is what I
found
tracing the patch itself.

The essence of this change is that, after the deadlock timeout fires,
instead of
returning to the caller the function re-waits toward the delay timeout
(STANDBY_TIMEOUT) inside itself -- the new second ProcWaitForSignal() in
ResolveRecoveryConflictWithBufferPin(). The code itself looks correct to
me.

It is also worth checking whether the two timeouts (deadlock / delay) can
fire
in the same SIGALRM and set both got_standby_delay_timeout and
got_standby_deadlock_timeout at once. And even if they cannot, whether the
code
should defensively account for that possibility -- rather than relying
implicitly
on the if/else ordering -- is worth considering.

And since this behavior depends on the relationship between
deadlock_timeout and
max_standby_streaming_delay (whether the deadlock branch is reached at all,
and
whether the second wait still has a timer), a test combining those
relationships
seems worthwhile. Mapping the four regimes against 054:

max_standby_streaming_delay | deadlock | how it ends |
patched | test
| branch | |
code |

--------------------------------+-------------+-----------------------+---------+----------
> deadlock_timeout | reached | cancel at delay |
yes | yes (031,054)
-1 (wait forever, ltime == 0) | reached | UnpinBuffer only, |
yes | none
| | no cancel -> infinite |
|
0 (immediate) | not reached | fast-path cancel at |
no | none
| | entry |
|
0 < delay < deadlock_timeout | not reached | cancel before the |
no | none
| | deadlock branch |
|

So the delay > deadlock_timeout regime is already well covered -- 031
exercises
both the buffer-pin conflict and the buffer-pin deadlock (via the
backend-side
check), and 054 adds the log timing -- though all with a finite delay. The
other three regimes are untested, and -1 in particular is the patch's core
path:
the only mode where the second wait has no startup-side timer and relies
solely
on the backend-side deadlock check plus UnpinBuffer.

I realize asking for that extra coverage may be a lot. But the -1 case in
particular has no startup-side timer to fall back on, so it seemed worth
widening coverage a little.

Regards,
Henson

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-06-26 18:06:36 Re: enhance wraparound warnings
Previous Message Peter Geoghegan 2026-06-26 15:49:48 Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions