Re: Blocking I/O, async I/O and io_uring

From: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robert(dot)haas(at)enterprisedb(dot)com>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Blocking I/O, async I/O and io_uring
Date: 2020-12-08 05:01:38
Message-ID: CAGRY4nx8hqNoUWpLHnE9FoUUWmegKT9pGiJyAb+hwn2iuYQSUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 8 Dec 2020 at 12:02, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> > A new kernel API called io_uring has recently come to my attention. I
> > assume some of you (Andres?) have been following it for a while.
>
> Yea, I've spent a *lot* of time working on AIO support, utilizing
> io_uring. Recently Thomas also joined in the fun. I've given two talks
> referencing it (last pgcon, last pgday brussels), but otherwise I've not
> yet written much about. Things aren't *quite* right yet architecturally,
> but I think we're getting there.
>

That's wonderful. Thankyou.

I'm badly behind on the conference circuit due to geographic isolation and
small children. I'll hunt up your talks.

The current state is at https://github.com/anarazel/postgres/tree/aio
> (but it's not a very clean history at the moment).
>

Fantastic!

Have you done much bpf / systemtap / perf based work on measurement and
tracing of latencies etc? If not that's something I'd be keen to help with.
I've mostly been using systemtap so far but I'm trying to pivot over to
bpf.

I hope to submit a big tracepoints patch set for PostgreSQL soon to better
expose our wait points and latencies, improve visibility of blocking, and
help make activity traceable through all the stages of processing. I'll Cc
you when I do.

> > io_uring appears to offer a way to make system calls including reads,
> > writes, fsync()s, and more in a non-blocking, batched and pipelined
> manner,
> > with or without O_DIRECT. Basically async I/O with usable buffered I/O
> and
> > fsync support. It has ordering support which is really important for us.
>
> My results indicate that we really want to have have, optional & not
> enabled by default of course, O_DIRECT support. We just can't benefit
> fully of modern SSDs otherwise. Buffered is also important, of course.
>

Even more so for NVDRAM, Optane and all that, where zero-copy and low
context switches becomes important too.

We're a long way from that being a priority but it's still not to be
dismissed.

I'm pretty sure that I've got the basics of this working pretty well. I
> don't think the executor architecture is as big an issue as you seem to
> think. There are further benefits that could be unlocked if we had a
> more flexible executor model (imagine switching between different parts
> of the query whenever blocked on IO - can't do that due to the stack
> right now).
>

Yep, that's what I'm talking about being an issue.

Blocked on an index read? Move on to the next tuple and come back when the
index read is done.

I really like what I see of the io_uring architecture so far. It's ideal
for callback-based event-driven flow control. But that doesn't fit postgres
well for the executor. It's better for redo etc.

> The way it currently works is that things like sequential scans, vacuum,
> etc use a prefetching helper which will try to use AIO to read ahead of
> the next needed block. That helper uses callbacks to determine the next
> needed block, which e.g. vacuum uses to skip over all-visible/frozen
> blocks. There's plenty other places that should use that helper, but we
> already can get considerably higher throughput for seqscans, vacuum on
> both very fast local storage, and high-latency cloud storage.
>
> Similarly, for writes there's a small helper to manage a write-queue of
> configurable depth, which currently is used to by checkpointer and
> bgwriter (but should be used in more places). Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.
>

Sure sounds like a useful interim step. That's great.

I've got asynchronous writing of WAL mostly working, but need to
> redesign the locking a bit further. Right now it's a win in some cases,
> but not others. The latter to a significant degree due to unnecessary
> blocking....
>

That's where io_uring's I/O ordering operations looked interesting. But I
haven't looked closely enough to see if they're going to help us with I/O
ordering in a multiprocessing architecture like postgres.

In an ideal world we could tell the kernel about WAL-to-heap I/O
dependencies and even let it apply WAL then heap changes out-of-order so
long as they didn't violate any ordering constraints we specify between
particular WAL records or between WAL writes and their corresponding heap
blocks. But I don't know if the io_uring interface is that capable.

I did some basic experiments a while ago with using write barriers between
WAL records and heap writes instead of fsync()ing, but as you note, the
increased blocking and reduction in the kernel's ability to do I/O
reordering is generally worse than the costs of the fsync()s we do now.

> I'm thinking that redo is probably a good first candidate. It doesn't
> > depend on the guts of the executor. It is much less sensitive to
> > ordering between operations in shmem and on disk since it runs in the
> > startup process. And it hurts REALLY BADLY from its single-threaded
> > blocking approach to I/O - as shown by an extension written by
> > 2ndQuadrant that can double redo performance by doing read-ahead on
> > btree pages that will soon be needed.
>
> Thomas has a patch for prefetching during WAL apply. It currently uses
> posix_fadvise(), but he took care that it'd be fairly easy to rebase it
> onto "real" AIO. Most of the changes necessary are pretty independent of
> posix_fadvise vs aio.
>

Cool. You know we worked on something like that in 2ndQ too, with
fast_redo, and it's pretty effective at reducing the I/O waits for b-tree
index maintenance.

How feasible do you think it'd be to take it a step further and structure
redo as a pipelined queue, where redo calls enqueue I/O operations and
completion handlers then return immediately? Everything still goes to disk
in the order it's enqueued, and the callbacks will be invoked in order, so
they can update appropriate shmem state etc. Since there's no concurrency
during redo, it should be *much* simpler than normal user backend
operations where we have all the tight coordination of buffer management,
WAL write ordering, PGXACT and PGPROC, the clog, etc.

So far the main issue I see with it is that there are still way too many
places we'd have to block because of logic that requires the result of a
read in order to perform a subsequent write. We can't just turn those into
event driven continuations on the queue and keep going unless we can
guarantee that the later WAL we apply while we're waiting is independent of
any changes the earlier pending writes might make and that's hard,
especially with b-trees. And it's those read-then-write ordering points
that hurt our redo performance the most already.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiro Ikeda 2020-12-08 05:06:52 About to add WAL write/fsync statistics to pg_stat_wal view
Previous Message Kyotaro Horiguchi 2020-12-08 04:45:33 Re: pg_rewind race condition just after promotion