Re: Asynchronous and "direct" IO support for PostgreSQL.

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Asynchronous and "direct" IO support for PostgreSQL.
Date: 2021-02-24 06:19:00
Message-ID: CA+hUKGK-563RQWQQF4NLajbQk+65gYHdb1q=7p3Ob0Uvrxoa9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 23, 2021 at 11:03 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> over the last ~year I spent a lot of time trying to figure out how we could
> add AIO (asynchronous IO) and DIO (direct IO) support to postgres. While
> there's still a *lot* of open questions, I think I now have a decent handle on
> most of the bigger architectural questions. Thus this long email.

Hello,

Very cool to see this project escaping onto -hackers!

I have done some work on a couple of low level parts of it, and I
wanted to show a quick "hey, where'd my system calls go?" demo, which
might help illustrate some very simple things about this stuff. Even
though io_uring is the new hotness in systems programming, I'm going
to use io_mode=worker here. It's the default in the current patch
set, it works on all our supported OSes and is easier to understand
without knowledge of shiny new or obscure old AIO system interfaces.
I'll also use io_workers=1, an artificially low setting to make it
easy to spy on (pseudo) async I/O with strace/truss/dtruss on a single
process, and max_parallel_workers_per_gather=0 to keep executor
parallelism from confusing matters.

The first thing to notice is that there's an "io worker" process, and
while filling up a big table with "insert into t select
generate_series(1, 100000000)", it's doing a constant stream of 128KB
pwritev() calls. These are writing out 16 blocks from shared buffers
at a time:

pwritev(44, [{iov_base=..., iov_len=73728},
{iov_base=..., iov_len=24576},
{iov_base=..., iov_len=32768}], 3, 228032512) = 131072

The reason there are 3 vectors there rather than 16 is just that some
of the buffers happened to be adjacent in memory and we might as well
use the smallest number of vectors. Just after we've started up and
the buffer pool is empty, it's easy to find big single vector I/Os,
but things soon get more fragmented (blocks adjacent on disk become
less likely to be adjacent in shared buffers) and that number goes up,
but that shouldn't make much difference to the OS or hardware assuming
decent scatter/gather support through the stack. If io_data_direct=on
(not the default) and the blocks are in one physical extent on the
file system, that might even go all the way down to the disk as a
single multi-segment write command for the storage hardware DMA engine
to beam directly in/out of our buffer pool without CPU involvement.

Mixed into that stream of I/O worker system calls, you'll also see WAL
going out to disk:

pwritev(15, [{iov_base=..., iov_len=1048576}], 1, 4194304) = 1048576

Meanwhile, the user session process running the big INSERT can be seen
signalling the I/O worker to wake it up. The same thing happens for
bgwriter, checkpointer, autovacuum and walwriter: you can see them all
handing off most of their I/O work to the pool of I/O workers, with a
bit of new signalling going on (which we try to minimise, and can
probably minimise much more). (You might be able to see some evidence
of Andres's new buffer cleaning scheme too, which avoids some bad
patterns of interleaving small reads and writes, but I'm skipping
right over here...)

Working through a very simple example of how the I/O comes to be
consolidated and parallelised, let's look at a simple non-parallel
SELECT COUNT(*) query on a large table. The I/O worker does a stream
of scattered reads into our buffer pool:

preadv(51, [{iov_base=..., iov_len=24576},
{iov_base=..., iov_len=8192},
{iov_base=..., iov_len=16384},
{iov_base=..., iov_len=16384},
{iov_base=..., iov_len=16384},
{iov_base=..., iov_len=49152}], 6, 190808064) = 131072

Meanwhile our user session backend can be seen waking it up whenever
it's trying to start I/O and finds it snoozing:

kill(1803835, SIGURG) = 0
kill(1803835, SIGURG) = 0
kill(1803835, SIGURG) = 0
kill(1803835, SIGURG) = 0
kill(1803835, SIGURG) = 0

Notice that there are no sleeping system calls in the query backend,
meaning the I/O in this example is always finished by the time the
executor gets around to accessing the page it requested, so we're
staying far enough ahead and we can be 100% CPU bound. In unpatched
PostgreSQL we'd hope to have no actual sleeping in such a simple case
anyway, thanks to the OS's readahead heuristics; but (1) we'd still do
individual pread(8KB) calls, meaning that the user's query is at least
having to pay the CPU cost of a return trip into the kernel and a
copyout of 8KB from kernel space to user space, here avoided, (2) in
io_data_direct=on mode, there's no page cache and thus no kernel read
ahead, so we need to replace that mechanism with something anyway, (3)
it's needed for non-sequential access like btree scans.

Sometimes I/Os are still run in user backends, for example because (1)
existing non-AIO code paths are still reached, (2) in worker mode,
some kinds of I/Os can't be handed off to another process due to lack
of a way to open some fds or because we're in single process mode, (3)
because a heuristic kicks in when we know there's only one I/O to run
and we know we'll immediately wait for it and we can skip a lot of
communication with a traditional synchronous syscall (worker mode only
for no, needs to be done for others).

In order to be able to generate a stream of big vectored reads/writes,
and start them far enough ahead of time that they're finished before
we need the data, there are several layers of new instructure that
Andres already mentioned and can explain far better than I, but super
briefly:

heapam.c uses a "pg_streaming_read" object (aio_util.c) to get buffers
to scan, instead of directly calling ReadBuffer(). It gives the
pg_streaming_read a callback of its own, so that heapam.c remains in
control of what is read, but the pg_streaming_read is in control of
readahead distance and also "submission". heapam.c's callback calls
ReadBufferAsync() to initiate reads of pages that it will need soon,
which it does with pgaio_io_start_read_sb() if there's a cache miss.
This results in 8KB reads queued up in the process's pending I/O list,
with pgaio_read_sb_complete as the completion function to run when
each read has eventually completed. When the pending list grows to a
certain length, it is submitted by pg_streaming_read code. That
involves first "combining" pending I/Os: this is where read/write of
adjacent ranges of files are merged into larger I/Os up to a limit.
Then the I/Os are submitted to the OS, and we'll eventually learn
about their completion, via io_method-specific means (see
aio_worker.c, aio_uring.c, aio_posix.c and one day probably also
aio_win32.c). At that point, merged I/Os will be uncombined.
Skipping over some complication about retrying on some kinds of
failure/partial I/O, that leads to ReadBufferCompleteWrite() being
called for each buffer. (Far be it from me to try to explain the
rather complex interlocking required to deal with pins and locks
between ReadBufferAsync() and ReadBufferCompleteWrite() in
(potentially) another process while the I/O is in progress, at this
stage.)

Places in the tree that want to do carefully controlled I/O depth
management can consider using pg_streaming_{read,write}, providing
their own callback to do the real work (though it's not necessary, and
not all AIO uses suit the "streaming" model). There's also the
traditional PrefetchBuffer() mechanism, which can still be used to
initiate buffer reads as before. It's comparatively primitive; since
you don't know when the I/O completes, you have to use conservative
models as I do in my proposed WAL prefetching patch. That patch (like
probably many others like CF #2799) works just fine on top of the AIO
branch, with some small tweaks: it happily shifts all I/O system calls
out of the recovery process, so that instead of calling
posix_fadvise() and then a bit later pread() for each cold page
accessed, it makes one submission system call for every N cold pages
(or, in some cases, no system calls at all). A future better
integration would probably use pg_streaming_read for precise control
of the I/O depth instead of the little LSN queue it currently uses,
but I haven't tried to write that yet.

If you do simple large INSERTs and SELECTs with one of the native
io_method settings instead of worker mode, it'd be much the same, in
terms of most of the architecture. The information in the pg_stat_XXX
views is almost exactly the same. There are two major differences:
(1) the other methods have no I/O worker processes, because the kernel
manages the I/O (or in some unfortunate cases runtime libraries fake
it with threads), (2) the "shared completion callbacks" (see
aio_scb.c) are run by I/O workers in worker mode, but are run by
whichever process "drains" the I/O in the other modes. That is,
initiating processes never hear about I/Os completing from the
operating system, they just eventually wait on them and find that
they're already completed (though they do run the "local callback" if
there there is one, which is for example the point at which eg
pg_streaming_read might initiate more I/O), or alternatively see that
they aren't, and wait on a condition variable for an I/O worker to
signal completion. So far this seems like a good choice...

Hope that helped show off a couple of features of this scheme.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message kuroda.hayato@fujitsu.com 2021-02-24 06:33:12 RE: Refactor ECPGconnect and allow IPv6 connection
Previous Message Rahila Syed 2021-02-24 06:06:50 Re: a misbehavior of partition row movement (?)