Quick Links

Re: Kernel AIO on FreeBSD, macOS and a couple of other Unixen

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Kernel AIO on FreeBSD, macOS and a couple of other Unixen
Date:	2025-08-08 22:28:09
Message-ID:	ztbfgkssz2jyzokirkrcfwnnhgky3dxt67d6q3gj3szfu2w4xs@wlru2hnfg3ex
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-08-05 23:26:23 +1200, Thomas Munro wrote:
> Up against io_method=worker in simple tests, you can't win sometimes
> because of the help they supply with checksum validation when they run
> completion, and if the data is served straight out of kernel cache
> that might be a bigger factor, but that's not specific to this IO
> method.

Right. I think that's much less of an issue in realistic workloads, where one
rarely is bound by memory bandwidth. It sure is easy to hit on intel with
things like pg_prewarm() though :(

Does the new IO method *ever* win?

> It also compiles and passes tests on Linux with glibc's user space
> thread-based POSIX AIO implementation, which could be convenient for
> developers, but it's emphatically not a target, it's utterly terrible.

I actually wonder if we shouldn't explicitly disable support for that. It
seems to be so broken that I'm worried about spending effort debugging it.

> Another problem was the lack of a completion queue on macOS. I think
> the history is that they ported kqueue from FreeBSD but removed
> EVFILT_AIO because they didn't have AIO yet, and then a couple of
> years later introduced AIO but never connected them together.
> Recently I hit on a simple new idea that doesn't require massive
> amounts of IO polling to recreate a completion queue, and it seems to
> have legs.

I think we should explicitly not support macOS. Posix AIO is terribly slow on
macos and requires tuning of sysctls to be usable at all.

> Some changes and problems that came up along the way as I pulled this
> old AIOv1 code apart and put it back together again for AIOv2:
>
> * I adopted the same general queue/completion_lock-per-backend design
> as io_uring adopted in AIOv2; I like it much better this way and it
> has a nice just-delete-lots-of-lines pathway to a multi-threaded mode

> * I don't think it's right for this IO method to PANIC if aio_read()
> etc fail with EAGAIN: it makes some sense for io_method=io_uring
> because it sizes its own submission queue perfectly, but with this
> API, equivalent resources are configured elsewhere eg sysctl -a | grep
> aio. So I think it should follow io_method=worker instead and fall
> back to synchronous execution if there's an unknowable system limit in
> the way (in practice this seems to be easiest to hit on a Mac with
> default sysctls)

> * To do that, I had to deviate slightly from the contract (and name)
> of pgaio_prepare_submit(), and call it *after* submitting, but if
> submission fails with EAGAIN, there was no supported way to set the
> PGAIO_HS_SYNCHRONOUS flag while already in PGAIO_HS_STAGED state
> before going to PGAIO_HS_SUBMITTED, without which a backend might
> concurrently reach wait_one() and hang, so I invented
> pgaio_io_prepare_submit_synchronously() that just sets that flag for
> you first

I'm not following why this is needed - after all worker mode doesn't need it?

I don't think you need to set PGAIO_HS_SYNCHRONOUS - that's a hint by the AIO
user to tell the AIO subsystem that the caller will immediately wait, I don't
think we should set it when we have to fall back to synchronous execution.

I think *if* we want to this, we should use a distinct flag, imo it'd be
confusing if falling-back-to-synchronous-IO looks the same as
inherently-synchronous-IO (i.e. we'll immediately block waiting for the IO).

> * In that new function, I discovered that if I didn't insert
> pg_read_barrier() before ioh->flags |= PGAIO_HS_SYNCHRONOUS, then the
> flag could occasionally be lost on my Mac laptop, which would normally
> imply a memory model screwup, except that I don't expect ioh->flags to
> be written to by any other backend, not in a previous generation, not
> ever after initialisation, which made me begin to wonder if I might be
> seeing something related to what Alexander and Konstantin reported,
> despite knowing that when you start to suspect the compiler or CPU is
> wrong it's usually just time to forget about computers and go for a
> walk, but what I am missing? The test workload was deliberately
> generating a lot of concurrent scans of the same table to exercise the
> cross-proces stuff, which means lots of asynchronous signals being
> handled while submission is underway...

Uh, huh. That's very very interesting. Do you have a recipe for how to
reproduce this? How long does it take you to reproduce?

Does the issue remain reproducibel if you change the bitfields in PgAioHandle
to uint8s? The code should just compile after that, except not getting
warnings in switch()es anymore.

> From 6574ac9267fe9938f59ed67c8f0282716d8c28f3 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> Date: Sun, 3 Aug 2025 00:15:01 +1200
> Subject: [PATCH v1 3/4] aio: Support I/O methods without true vectored I/O.

What OSs do we need this for? IIUC we only need this for MacOS for this
patchset? And then probably for windows if we introduce native windows AIO?

I hit a related issue when trying out using "fixed buffers" with
io_uring. Fixed buffers allow io_uring to avoid pinning/unpinning buffers when
starting/completing IOs, which turns out to be significant scalability boost
when using huge pages (a lot of postgres pages get mapped onto a single OS
level page, all using the same refcount, which causes contention). However
io_uring fixed buffers only support addressing 4GB of memory within one IO,
which means that we need to avoid combining IO if the buffers are further
apart than that. The reason that that is interesting is that that means we'd
need to pass in the buffers for the IO method to be able to determine whether
the IO can be executed as a single IO.

> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 67431208e7f..386aa658f8d 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1268,6 +1268,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
> {
> int actual_nblocks = *nblocks;
> int maxcombine = 0;
> + bool vectorcombine = false;
> bool did_start_io;
>
> Assert(*nblocks == 1 || allow_forwarding);
> @@ -1373,6 +1374,7 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
> */
> if (i == 0 && actual_nblocks > 1)
> {
> + vectorcombine = smgrvectorcombine(operation->smgr);
> maxcombine = smgrmaxcombine(operation->smgr,
> operation->forknum,
> blockNum);
> @@ -1383,6 +1385,20 @@ StartReadBuffersImpl(ReadBuffersOperation *operation,
> actual_nblocks = maxcombine;
> }
> }
> +
> + /*
> + * If true vectored I/O is not available, check if we've cross
> + * into a non-consecutive buffer and need to rewind, to avoid
> + * forming a read that can't be executed efficiently. If so, the
> + * discontiguous pinned buffer is forwarded to the next call.
> + */
> + if (i > 0 && !vectorcombine)
> + {
> + if (BufferIsLocal(buffers[i]) ?
> + -buffers[i] != -buffers[i - 1] + 1 :
> + buffers[i] != buffers[i - 1] + 1)
> + actual_nblocks = i;
> + }
> }
> }
> *nblocks = actual_nblocks;

I think it'd be nice if we could combine smgrvectorcombine() and
smgrmaxcombine() into one. Two smgr dispatches for every buffer seems a bit
absurd.

> +/* Platform quirks and characteristics. */
> +#if defined(__APPLE__) || defined(__FreeBSD__) || defined(__NetBSD__)
> +#define PGAIO_POSIX_AIO_CLOSE_HARMLESS
> +#endif
> +#ifdef __APPLE__
> +#define PGAIO_POSIX_AIO_LIO_NOTIFICATIONS_BROKEN
> +#endif
> +#ifdef __GLIBC__
> +#define PGAIO_POSIX_AIO_ASYNC_SIGNAL_SAFETY_BROKEN
> +#endif

I'm really really really doubtful it's a good idea to support linux and
macos. I wasted so much time due to them with AIOv1 and I see zero practical
use of posix_aio on those platforms.

> +/* Per-backend state. */
> +typedef struct pg_attribute_aligned (PG_CACHE_LINE_SIZE)
> +PgAioPosixAioContext
> +{
> + /* Serialize completion processing for each backend. */
> + LWLock completion_lock;
> +
> + /* Cross-process completion support. */
> + pg_atomic_uint32 ipc_procno;;

superfluous semicolon.

> +static void
> +pgaio_posix_aio_init_backend(void)
> +{
> + pgaio_my_posix_aio_context = &pgaio_posix_aio_contexts[MyProcNumber];
> +
> + /* AIO control blocks for this backend's IOs. */
> + pgaio_posix_aio_naiocbs = dclist_count(&pgaio_my_backend->idle_ios);

Isn't that just an alias for io_max_concurrency?

> + pgaio_posix_aio_aiocbs = palloc(sizeof(struct aiocb) *
> + pgaio_posix_aio_naiocbs);
> +

I assume it's guaranteed that we are in a sufficiently long-lived memory
context?

> +static PgAioHandle *
> +pgaio_posix_aio_get_io_by_id(int32 id)
> +{
> + return &pgaio_ctl->io_handles[id];
> +}

A bit doubtful this is the right thing to have in posix AIO specific code,
but...

> +static void
> +pgaio_posix_aio_prepare_aiocb(PgAioHandle *ioh, struct aiocb *aiocb)
> +{
> + struct iovec *iov;
> +
> + memset(aiocb, 0, sizeof(*aiocb));
> +
> + switch (ioh->op)
> + {
> + case PGAIO_OP_READV:
> + iov = &pgaio_ctl->iovecs[ioh->iovec_off];
> +#ifndef PGAIO_POSIX_AIO_HAVE_VECTORED_OPS
> + /* Force short read, first vector only. */
> + ioh->op_data.read.iov_length = 1;
> +#endif

Is this actually reachable code? I thought the smgrvectorcombine() should
prevent this from being needed? Maybe we could make it part of the AIO API
contract that it's only permissible to do vectored IO if supported?

> +static inline void
> +pgaio_posix_aio_submitted(PgAioHandle *ioh)
> +{
> +#ifdef PGAIO_POSIX_AIO_USE_MERGED_COMPLETION_SIGNAL
> + /*
> + * When using a merged signal, store the special not-yet-drained value
> + * after submitting but before advancing to PGAIO_HS_SUBMITTED. Otherwise
> + * pgaio_posix_aio_ipc_handler() could reach aio_error() before the aiocb
> + * is known to the OS.
> + */
> + ioh->result = -EINPROGRESS;
> +#endif
> +
> + /*
> + * Advance to PGAIO_HS_SUBMITTED state only after successful submission.
> + * If we advanced it sooner, wait_one() could be reached for an IO that is
> + * falling back to synchronous execution and hang.
> + */
> + pgaio_io_prepare_submit(ioh);
> +}

> +/*
> + * Supported completion queue APIs providing the functions
> + * pgaio_posix_aio_{init,prepare,drain}_completion_queue().
> + */
> +
> +#ifdef PGAIO_POSIX_AIO_USE_MERGED_COMPLETION_SIGNAL
> +/*
> + * POSIX intended realtime (unmerged, queued, payload-carrying) signals as the
> + * completion queues for AIO, and they can tell you which IO completed, but
> + * they are baroque and unportable: queues overflow, and macOS hasn't got them.
> + *
> + * Just use a plain old merged signal as a completion notification, and poll
> + * the inflight IO queue to unmerge it:
> + *
> + * * If using kevent(), EVFILT_SIGNAL has a merge count, so we can poll in
> + * submission order and terminate when we've found them all. This avoids
> + * almost all spurious polling as long as completion order roughly matches.
> + *
> + * * If using sigwait(), available on all POSIX systems, then we have to
> + * poll all inflight IOs. That's terrible, but compiles and runs on
> + * Linux/glibc using the same code paths, for basic testing only.
> + *
> + * FreeBSD needs none of this and never polls. It receives EVFILT_AIO
> + * events containing the IO and error status. Hopefully macOS and NetBSD will
> + * connect their AIO to their kqueue one day. (Commercial Unixen all had their
> + * own kqueue-like thing for this, to review if AIX support is ever wanted.)
> + */
> +
> +#if defined(SIGPOLL)
> +#define PGAIO_POSIX_AIO_COMPLETION_SIGNO SIGPOLL
> +#elif defined(SIGIO)
> +#define PGAIO_POSIX_AIO_COMPLETION_SIGNO SIGIO
> +#else
> +#error "can't find a signal to use for I/O completion notification from kernel"
> +#endif

All this portability stuff here scares me. I think we should just reduce this
to freebsd (and perhaps netbsd). We can add support for other OSs if they have
a reasonably performing posix AIO implementation.

> +/*
> + * Acquire this backend's own completion_lock, and if some other backend holds
> + * it, also command pgaio_posix_aio_ipc_drain_completion_queue() to give up
> + * early since this backend can process its own queue promptly and efficiently.
> + */
> +static void
> +pgaio_posix_aio_ipc_acquire_own_completion_lock(PgAioPosixAioContext *context)
> +{
> + Assert(context == pgaio_my_posix_aio_context);
> + Assert(!LWLockHeldByMe(&context->completion_lock));
> +
> + if (!LWLockConditionalAcquire(&context->completion_lock, LW_EXCLUSIVE))
> + {
> + ProcNumber procno;
> +
> + procno = pg_atomic_exchange_u32(&context->ipc_procno, MyProcNumber);
> + if (procno != INVALID_PROC_NUMBER)
> + SetLatch(&GetPGProcByNumber(procno)->procLatch);
> +
> + LWLockAcquire(&context->completion_lock, LW_EXCLUSIVE);
> + pg_atomic_write_u32(&context->ipc_procno, INVALID_PROC_NUMBER);
> + }
> +}

Is the "command pgaio_posix_aio_ipc_drain_completion_queue() to give up" path
frequent enough to be worth the complexity? I somewhat doubt so?

Greetings,

Andres Freund

In response to

Kernel AIO on FreeBSD, macOS and a couple of other Unixen at 2025-08-05 11:26:23 from Thomas Munro

Responses

Re: Kernel AIO on FreeBSD, macOS and a couple of other Unixen at 2025-08-08 22:34:33 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-08-08 22:34:33	Re: Kernel AIO on FreeBSD, macOS and a couple of other Unixen
Previous Message	Tom Lane	2025-08-08 22:27:44	Re: dikkop seems unhappy because of openssl stuff (FreeBSD 14-BETA1)