Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Xuneng Zhou <xunengzhou(at)gmail(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: 048_vacuum_horizon_floor.pl hangs due to wakeup lost inside LockBufferForCleanup
Date: 2026-06-22 16:43:29
Message-ID: CAAKRu_b6C_VYoopvDxKogMM148o1E7xQSte1rSu6v58RbhzedA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 22, 2026 at 4:14 AM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> The direct regression appears to be 5310fac6e0f. It allows this interleaving:
>
> W: LockBufferForCleanup() holds buffer header lock
> W: observes refcount > 1
> P: releases the last competing pin with atomic fetch_sub
> P: old state does not contain BM_PIN_COUNT_WAITER, so no wakeup
> W: publishes BM_PIN_COUNT_WAITER
> W: sleeps in ProcWaitForSignal()
>
> At this point the condition W wanted is already true: refcount is 1,
> meaning only W's own pin remains. So W could sleep indefinitely as no
> future unpin to wake it.
>
> We can fix this with the state returned by UnlockBufHdrExt() when
> publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
> enter the wait path. Instead, fall through to the existing waiter-bit
> cleanup and retry the loop to acquire the cleanup lock normally. The
> reproducer test passed after applying the patch.

Thanks for investigating!
Does the reproducer pass prior to 5310fac6e0f?

- Melanie

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Vitaly Davydov 2026-06-22 18:32:50 Re: Deadlock detector fails to activate on a hot standby replica
Previous Message Tom Lane 2026-06-22 16:04:42 Re: psql: Fix CREATE SCHEMA scanning of nested routine bodies