Re: Deadlock detector fails to activate on a hot standby replica

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Deadlock detector fails to activate on a hot standby replica
Date: 2026-06-10 08:42:19
Message-ID: CAHGQGwFED52rRVb_a=bVSdbOA=oARVOUyw381GC_6uKoNWJAtQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 10, 2026 at 1:18 AM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
>
> On 6/5/26 14:58, Fujii Masao wrote:
> > I've updated the patch based on these comments.
> > Attached is the latest version.
> >
> > I removed the TAP test from this patch for now. I'll consider adding
> > a test for this separately later.
>
> Thank you again for your assistance. I agree with the code changes you've
> proposed. I've updated 031_recovery_conflict.pl to include the deadlock
> test, instead of creating a separate TAP test. The attached patch contains
> these changes in its second commit. I've kept the patch version as v3 since
> the original fix remains unchanged; this update only adds new tests.

Thanks for updating the regression test patch!

I reviewed both patches and found a few issues, so I fixed them and
attached updated v4 versions.

For v3-0001, when the startup process wakes up from
ProcWaitForSignal() and finds the pin count is still not 1, it waits
again. However, before waiting again, it must set
BM_PIN_COUNT_WAITER again. Otherwise, a backend releasing the pin
cannot signal the startup process because the flag was already cleared.

For example:
1. Backend A releases the pin, signals the startup process, and clears
BM_PIN_COUNT_WAITER.
2. The startup process wakes up from ProcWaitForSignal().
3. Before it rechecks the pin count, backend B acquires a pin.
4. The startup process sees the pin count is still 2.
5. The startup process waits again.

At step 5, BM_PIN_COUNT_WAITER is already cleared, so backend B will
not signal the startup process when releasing the pin.

This race is unlikely in practice, but it is still possible. So in
v4-0001, I fixed this by re-registering the startup process as the
waiter before waiting again.

For v3-0002, the two deadlock tests contained very similar scenarios.
I think it's simpler to move the common logic into a shared subroutine.
So in v4-0002, I added a helper routine for testing the deadlock
between the startup process and a backend, and both tests now use it.

Could you review the v4 patches?

Regards,

--
Fujii Masao

Attachment Content-Type Size
v4-0002-Add-new-deadlock-conflict-test-in-031_recovery_co.patch application/octet-stream 7.8 KB
v4-0001-Fix-deadlock-detector-activation-in-a-recovery-co.patch application/octet-stream 8.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Xuneng Zhou 2026-06-10 08:36:14 Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline