| From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
|---|---|
| To: | Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Deadlock detector fails to activate on a hot standby replica |
| Date: | 2026-06-12 09:23:01 |
| Message-ID: | CAHGQGwHh5j=sG3AZh=HyuXtPM6cmymp6MJkRAMJcJUyN2qpHrQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Jun 12, 2026 at 3:00 PM Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> wrote:
>
> Dear Fujii Masao,
>
> On 6/10/26 11:42, Fujii Masao wrote:
> > Could you review the v4 patches?
>
> I've got the issue with BM_PIN_COUNT_WAITER. The origin of the issue is that
> the other backend resets the flag before sending the notification (signal)
> to the waiter. I agree with your changes where this flag is set again.
>
> I would like to propose BufferIsReadyForCleanup function name instead of
> BufferShouldWaitForCleanupLockForStandby (and change the returning value
> semantics). There are other possible alternatives: BufferIsReadyForWriting,
> BufferIsReadyForExclusiveAccess or something else. The code will look like
> below:
>
> if (BufferIsReadyForCleanup(bufid + 1))
> break;
Okay.
> I've also found strange email in the commit messages like below:
> Author: masao.fujii <masao(dot)fujii(at)masao(dot)fujii’s-MacBook-Pro>
That's my mistake. Sorry. I'll fix it in the next version of the patch.
> I tried the modified 031 tap test but I do not see that it reproduces the issue
> on the version without the fix. My version of 031 tap test hangs without the
> fix. I haven't yet reviewed the tap test code at the moment, but I'm going to do
> it as well.
I retried the modified 031 TAP test on my side without the fix, and
the test actually passed, so the test did not reliably reproduce the issue.
However, after increasing deadlock_timeout to 10s, the test hung
without the fix as expected, while it passed with the fix.
Based on this, I suspect the behavior depends on timing interactions
between deadlock_timeout and SIGALRM delivery caused by
log_startup_progress_interval. My test changes may simply have altered
the timing of those SIGALRM deliveries.
Also, the current test assumes that
log_startup_progress_interval is effective during standby WAL replay.
But as discussed in [1], that assumption may actually be wrong. It
seems to work today, but that could just be accidental and might
change in the future. So I think we should consider a testing approach
that does not depend on that assumption.
At the moment, the only alternative I can think of is to send SIGALRM
periodically to the startup process directly from the test, though I'm
not sure that's a good approach either...
> Below one more note that is not directly related to the fix but may probably
> improve the code readability.
I agree that some refactoring along those lines should be possible.
However, for backpatching to older branches, I'd prefer to avoid more
code churn than necessary for the bug fix itself. So I think it would
be better to commit the bug fix first, and then consider a refactoring
version separately for v20dev.
Regards,
--
Fujii Masao
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Antonin Houska | 2026-06-12 09:52:12 | Re: REPACK CONCURRENTLY fails on tables with generated columns |
| Previous Message | Ewan Young | 2026-06-12 08:39:54 | REPACK CONCURRENTLY fails on tables with generated columns |