| From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
|---|---|
| To: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
| Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup |
| Date: | 2026-06-22 16:43:29 |
| Message-ID: | CAAKRu_b6C_VYoopvDxKogMM148o1E7xQSte1rSu6v58RbhzedA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Jun 22, 2026 at 4:14 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> The direct regression appears to be 5310fac6e0f. It allows this interleaving:
>
> W: LockBufferForCleanup() holds buffer header lock
> W: observes refcount > 1
> P: releases the last competing pin with atomic fetch_sub
> P: old state does not contain BM_PIN_COUNT_WAITER, so no wakeup
> W: publishes BM_PIN_COUNT_WAITER
> W: sleeps in ProcWaitForSignal()
>
> At this point the condition W wanted is already true: refcount is 1,
> meaning only W's own pin remains. So W could sleep indefinitely as no
> future unpin to wake it.
>
> We can fix this with the state returned by UnlockBufHdrExt() when
> publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
> enter the wait path. Instead, fall through to the existing waiter-bit
> cleanup and retry the loop to acquire the cleanup lock normally. The
> reproducer test passed after applying the patch.
Thanks for investigating!
Does the reproducer pass prior to 5310fac6e0f?
- Melanie
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Vitaly Davydov | 2026-06-22 18:32:50 | Re: Deadlock detector fails to activate on a hot standby replica |
| Previous Message | Tom Lane | 2026-06-22 16:04:42 | Re: psql: Fix CREATE SCHEMA scanning of nested routine bodies |