| From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
|---|---|
| To: | Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Deadlock detector fails to activate on a hot standby replica |
| Date: | 2026-06-10 08:42:19 |
| Message-ID: | CAHGQGwFED52rRVb_a=bVSdbOA=oARVOUyw381GC_6uKoNWJAtQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Jun 10, 2026 at 1:18 AM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
>
> On 6/5/26 14:58, Fujii Masao wrote:
> > I've updated the patch based on these comments.
> > Attached is the latest version.
> >
> > I removed the TAP test from this patch for now. I'll consider adding
> > a test for this separately later.
>
> Thank you again for your assistance. I agree with the code changes you've
> proposed. I've updated 031_recovery_conflict.pl to include the deadlock
> test, instead of creating a separate TAP test. The attached patch contains
> these changes in its second commit. I've kept the patch version as v3 since
> the original fix remains unchanged; this update only adds new tests.
Thanks for updating the regression test patch!
I reviewed both patches and found a few issues, so I fixed them and
attached updated v4 versions.
For v3-0001, when the startup process wakes up from
ProcWaitForSignal() and finds the pin count is still not 1, it waits
again. However, before waiting again, it must set
BM_PIN_COUNT_WAITER again. Otherwise, a backend releasing the pin
cannot signal the startup process because the flag was already cleared.
For example:
1. Backend A releases the pin, signals the startup process, and clears
BM_PIN_COUNT_WAITER.
2. The startup process wakes up from ProcWaitForSignal().
3. Before it rechecks the pin count, backend B acquires a pin.
4. The startup process sees the pin count is still 2.
5. The startup process waits again.
At step 5, BM_PIN_COUNT_WAITER is already cleared, so backend B will
not signal the startup process when releasing the pin.
This race is unlikely in practice, but it is still possible. So in
v4-0001, I fixed this by re-registering the startup process as the
waiter before waiting again.
For v3-0002, the two deadlock tests contained very similar scenarios.
I think it's simpler to move the common logic into a shared subroutine.
So in v4-0002, I added a helper routine for testing the deadlock
between the startup process and a backend, and both tests now use it.
Could you review the v4 patches?
Regards,
--
Fujii Masao
| Attachment | Content-Type | Size |
|---|---|---|
| v4-0002-Add-new-deadlock-conflict-test-in-031_recovery_co.patch | application/octet-stream | 7.8 KB |
| v4-0001-Fix-deadlock-detector-activation-in-a-recovery-co.patch | application/octet-stream | 8.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Xuneng Zhou | 2026-06-10 08:36:14 | Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline |