Re: Potential deadlock in pgaio_io_wait()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Potential deadlock in pgaio_io_wait()
Date: 2025-08-19 00:07:23
Message-ID: kl7wzxh6lhtbexnkp7meobqjksnayvzvh53uzsdjsubdfkom37@kemmr6est2g6
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
> I discussed this off-list with Andres who provided the following review:
>
> * +1 for the analysis
> * +1 for the solution
> * his benchmark machine shows no regression under heavy IO submission workload
> * needs better comments
>
> I had expected that patch to be rejected as too slow. I was thinking
> that it should be enough to insert a memory barrier and then do an
> unlocked check for an empty waitlist in ConditionVariableBroadcast()
> to avoid a spinlock acquire/release. That caused a few weird hangs I
> didn't understand, which is why I didn't post that version last time,
> but he pointed out that the other side also needs a memory barrier in
> ConditionVariablePrepareToSleep() and the existing spinlock
> acquire/release is not enough.

FTR, the reason that we would need a barrier there is that we define
SpinLockRelease() as a non-atomic volatile store on x86. That means that there
is no guarantee that another backend sees an up2date view of the memory if
reading a memory location *without* acquiring the spinlock. The reason this is
correct for spinlocks is that x86 has strongly ordered stores, and the
*acquisition* of a spinlock will only succeed once the release of the spinlock
has become visible. But if you read something *without* the spinlock, you're
not guaranteed to see all the changes done without the spinlock.

I do wonder if we accidentally rely on SpinLockRelease() being a barrier
anywhere...

> I also thought of a small optimisation, presented in the -B patch.
> It's a bit of a shame to wait for backend->submit_cv and then also
> ioh->cv in io_method=worker. It's just a bunch of IPC ping-pong for
> nothing. So I figured it should be allowed to fall all the way
> through based on its lack of ->wait_one. Worth bothering with?

I don't think I'd go there without concrete evidence that it helps - it makes
it harder to understand when to wait for what. Not terribly so, but enough
that I'd not go there without some measurable benefit.

> On a superficial note:
>
> AIO_IO_COMPLETION "Waiting for another process to complete IO."
> +AIO_IO_SUBMIT "Waiting for another process to submit IO."
> AIO_IO_URING_SUBMIT "Waiting for IO submission via io_uring."
> AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."

> We're inconsistent in our choice of noun or verb. I went with _SUBMIT
> to match the following line, rather than _SUBMISSION to match the
> preceding line. Shrug.

FWIW, I think so far it's a verb when the process is doing the work (e.g. the
process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for
somebody *else* to do the completion, since we're not doing work ourselves.

> From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> Date: Sun, 3 Aug 2025 23:07:56 +1200
> Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.
>
> Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
> PGAIO_HS_STAGED fell through to waiting for completion. The owner only
> promises to advance it to PGAIO_HS_SUBMITTED. The waiter needs to be
> prepared to call ->wait_one() itself once the IO is submitted in order
> to guarantee progress and avoid deadlocks on IO methods that provide
> ->wait_one().
>
> Introduce a new per-backend condition variable submit_cv, woken by by
> pgaio_submit_stage(), and use it to wait for the state to advance. The
> new broadcast doesn't seem to cause any measurable slowdown, so ideas
> for optimizing the common no-waiters case were abandoned for now.
>
> It may not be possible to reach any real deadlock with existing AIO
> users, but that situation could change. There's also no reason the
> waiter shouldn't begin to wait via the IO method as soon as possible
> even without a deadlock.
>
> Picked up by testing a proposed IO method that has ->wait_one(), like
> io_method=io_uring, and code review.

LGTM.

Greetings,

Andres Freund

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2025-08-19 00:08:29 Re: TRAP: failed Assert("outerPlan != NULL") in postgres_fdw.c
Previous Message Nathan Bossart 2025-08-18 21:53:50 Re: Error pg_upgrade version 11 to 15

Browse pgsql-hackers by date

  From Date Subject
Next Message Matheus Alcantara 2025-08-19 00:14:15 Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Previous Message Andy Fan 2025-08-18 23:55:10 Re: Proposal: Extending the PostgreSQL Protocol with Command Metadata