| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | assam258(at)gmail(dot)com |
| Cc: | JoongHyuk Shin <sjh910805(at)gmail(dot)com>, Ilmar Yunusov <tanswis42(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: [PATCH] Prevent repeated deadlock-check signals in standby buffer pin waits |
| Date: | 2026-06-28 02:10:11 |
| Message-ID: | CABPTF7Wr7HY--1Foj3qmx+D1=pJGZPXqOdvHseHDzY2odYVRRQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Henson,
Thanks for your review and showing the "chain of thought"!
On Sat, Jun 27, 2026 at 1:11 AM Henson Choi <assam258(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> I'm fairly new to this part of the tree, so before reviewing the patch itself
> I spent some time on the surrounding code to make sure I actually understand
> the setting. Let me lay out that understanding first -- I'd appreciate
> corrections where I have it wrong.
>
> The question I kept coming back to is: why is the hot-standby case the only one
> that needs all this machinery, when the primary takes a single, untimed
> ProcWaitForSignal(), and a cold standby never runs into the contention at all?
>
> My understanding of the chain:
>
> - A cleanup lock is stronger than an exclusive content lock: it also requires
> sole pinning (refcount == 1). Its purpose is to let prune/VACUUM physically
> remove tuples and compact the page, so there must be no other backend still
> referencing a tuple in that page by pointer -- and a pin is how that presence
> is detected.
I agree with your analysis here. The cleanup lock is the key to
enforce rule of 2
Buffer access rules:
2. Once one has determined that a tuple is interesting (visible to the
current transaction) one may drop the content lock, yet continue to access
the tuple's data for as long as one holds the buffer pin. This is what is
typically done by heap scans, since the tuple returned by heap_fetch
contains a pointer to tuple data in the shared buffer. Therefore the
tuple cannot go away while the pin is held (see rule #5). Its state could
change, but that is assumed not to matter after the initial determination
of visibility is made.
The pin itself is just a mechanism to ensure the stability of mapping
from page to buffer frame. It cannot guarantee the stability of tuples
on its own. To ensure scan can keep using a tuple pointer after
dropping the content lock(buffer access rule of 2), we need the HOT
prune / vacuum operations to adhere to the cleanup lock rule(sole
pinning): do not change the physical layout of the page unless nobody
other than us referencing it.
> - On the primary, the waiter is just VACUUM/prune, and a stall delays only that
> one operation -- the rest of the system runs on -- so waiting indefinitely
> (the untimed ProcWaitForSignal) is harmless.
I would not frame this as "harmless", the stall can still delay
VACUUM/pruning/index cleanup and contribute to bloat or local stalls
in primary.
> - A cold standby (hot_standby = off) has no read-only queries, hence no query
> backend pinning the page to wait for, so it never even waits -- simpler still
> than the primary.
I agree that a cold standby lacks the normal source of long-lived
conflicting pins. But lockBufferForCleanup() seems to still use the
simple non-hot-standby ProcWaitForSignal() path if some short-lived
internal pin exists.
> The special case is the hot standby.
>
> - On a standby, VACUUM (and prune) is also carried out by WAL replay -- the
> cleanup that happened on the primary arrives as a WAL record, which the
> startup process replays under the cleanup lock.
>
> - But a hot standby is also serving read-only queries, whose pins -- e.g. a
> cursor holding a page across a client round-trip -- conflict with that
> cleanup lock.
>
> - Replay is a single serialized stream, so one stalled record stalls every WAL
> record behind it: the conflict is no longer isolated, it becomes replication
> lag for the whole standby.
>
> - So the replay side cannot just wait -- it arms STANDBY_DEADLOCK_TIMEOUT and
> STANDBY_TIMEOUT, and ultimately cancels the pin holder at max_standby_*_delay.
>
> So the complexity is specific to doing two things at once on one node: serving
> queries and replaying a strictly ordered stream. The primary avoids it because
> its waits are isolated; the cold standby avoids it because it has no queries.
> The hot standby has neither escape, and this path is where that tension is
> sharpest -- which is why a small change to it seems to warrant more careful
> review than most.
>
> Does the above match how others see it? Assuming it does, here is what I found
> tracing the patch itself.
I think another important reason for this design of primary is: the
workload of primary is hybrid, mixed with write and read operations.
We cannot say this to a backend / application from system level: you
guys have been writing/reading too much for too long, so we decide to
kill you or let you decide your own fate right now just because what
you are doing here is in conflict with the maintenance operations
whose purpose are hopefully to let you continue your work in the
future. So we choose to let vaccum/prune to wait since it would harm
both availability of the system and the confidence of the user if it
is done otherwise. For hot standby, that's another story as you said.
> The essence of this change is that, after the deadlock timeout fires, instead of
> returning to the caller the function re-waits toward the delay timeout
> (STANDBY_TIMEOUT) inside itself -- the new second ProcWaitForSignal() in
> ResolveRecoveryConflictWithBufferPin(). The code itself looks correct to me.
>
> It is also worth checking whether the two timeouts (deadlock / delay) can fire
> in the same SIGALRM and set both got_standby_delay_timeout and
> got_standby_deadlock_timeout at once. And even if they cannot, whether the code
> should defensively account for that possibility -- rather than relying implicitly
> on the if/else ordering -- is worth considering.
>
> And since this behavior depends on the relationship between deadlock_timeout and
> max_standby_streaming_delay (whether the deadlock branch is reached at all, and
> whether the second wait still has a timer), a test combining those relationships
> seems worthwhile. Mapping the four regimes against 054:
>
> max_standby_streaming_delay | deadlock | how it ends | patched | test
> | branch | | code |
> --------------------------------+-------------+-----------------------+---------+----------
> > deadlock_timeout | reached | cancel at delay | yes | yes (031,054)
> -1 (wait forever, ltime == 0) | reached | UnpinBuffer only, | yes | none
> | | no cancel -> infinite | |
> 0 (immediate) | not reached | fast-path cancel at | no | none
> | | entry | |
> 0 < delay < deadlock_timeout | not reached | cancel before the | no | none
> | | deadlock branch | |
>
> So the delay > deadlock_timeout regime is already well covered -- 031 exercises
> both the buffer-pin conflict and the buffer-pin deadlock (via the backend-side
> check), and 054 adds the log timing -- though all with a finite delay. The
> other three regimes are untested, and -1 in particular is the patch's core path:
> the only mode where the second wait has no startup-side timer and relies solely
> on the backend-side deadlock check plus UnpinBuffer.
>
> I realize asking for that extra coverage may be a lot. But the -1 case in
> particular has no startup-side timer to fall back on, so it seemed worth
> widening coverage a little.
I haven't reviewed the patch yet.
--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chengpeng Yan | 2026-06-28 05:13:47 | Re: Improve row estimation with multi-column unique indexes |
| Previous Message | Arne Roland | 2026-06-27 23:05:32 | Re: Key joins |