Re: Streaming I/O, vectored I/O (WIP)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2024-01-11 03:19:48
Message-ID: CA+hUKG+PcmJXQUns-ZPcQ-MTH9Tv38dEj2infL0PRXj6fAVDrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 10/01/2024 06:13, Thomas Munro wrote:
> > Bikeshedding call: I am open to better suggestions for the names
> > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
> > grammatically clumsy.
>
> How will these functions work in the brave new async I/O world? I assume
> PrepareReadBuffer() will initiate the async I/O, and
> CompleteReadBuffers() will wait for it to complete. How about
> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
> WaitBufferRead()?

What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

* PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
* StartReadBuffers() -> starts the I/O for n contiguous !found buffers
* CompleteReadBuffers() -> waits, completes as necessary

In the proposed synchronous version, the middle step is missing, but
streaming_read.c directly calls smgrprefetch() instead. I thought
about shoving that inside a prosthetic StartReadBuffers() function,
but I backed out of simulating asynchronous I/O too fancifully. The
streaming read API is where we really want to stabilise a nice API, so
we can moves things around behind it if required.

A bit of analysis of the one block -> nblocks change and the
synchronous -> asynchronous change:

Two things are new in a world with nblocks > 1. (1) We needed to be
able to set BM_IO_IN_PROGRESS on more than one block at a time, but
commit 12f3867f already provided that, and (2) someone else might come
along and read a block in the middle of our range, effectively
chopping our range into subranges. That's true also in master but
when nblocks === 1 that's all-or-nothing, and now we have partial
cases. In the proposed synchronous code, CompleteReadBuffers() claims
as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and
then loops process the rest, skipping over any blocks that are already
done. Further down in md.c, you might also cross a segment boundary.
So that's two different reasons while a single call to
CompleteReadBuffers() might finish up generating zero or more than one
I/O system call, though very often it's one.

Hmm, while spelling that out, I noticed an obvious problem and
improvement to make to that part of v5. If backend #1 is trying to
read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but
backend #2 comes along and starts reading block 102 first, backend
#1's StartBufferIO() call would wait for 102's I/O CV while it still
holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third
backend #3 that wants to read block 101 even though no I/O is in
progress for that block yet! At least that's deadlock free (because
always in block order), but it seems like undesirable lock chaining.
Here is my proposed improvement: StartBufferIO() gains a nowait flag.
For the head block we wait, but while trying to build a larger range
we don't. We'll try 102 again in the next loop, with a wait. Here is
a small fixup for that.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs). One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary. I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one <handle
thing>.

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet. Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged. While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

> About the signature of those functions: Does it make sense for
> CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector
> of buffers? If StartReadBuffer() initiates the async I/O immediately, is
> there any benefit to batching the waiting?
>
> If StartReadBuffer() starts the async I/O, the idea that you can call
> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
> StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
> you in RBM_ZERO_* modes.

Yeah, good thoughts, and topics that have occupied me for some time
now. I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL. In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode. But that feels just wrong
-- we'd be using RBM but implementing them only partially.

Another way to put it is that ReadBufferMode actually conflates a
bunch of different behaviour that applies at different times that we
are now separating, and recovery reveals this most clearly because it
doesn't have all the information needed while looking ahead. It might
be possible to shove more information in the WAL to fix the
information problem, but it seemed more natural to me to separate the
aspects of ReadBufferMode, because that isn't the only problem:
groupwise-processing of lock doesn't even make sense.

So I teased a couple of those aspects out into separate flags, for example:

The streaming read interface has two variants: the "simple" implicit
RBM_NORMAL, single relation, single fork version that is used in most
client examples and probably everything involving the executor, and
the "extended" version (like the earlier versions in this thread,
removed for now based on complaints that most early uses don't use it,
will bring it back separately later with streaming recovery patches).
In the extended version, the streaming callback can set *will_zero =
true, which is about all the info that recovery can figure out from
the WAL anyway, and then XLogReadBufferForRedoExtended() will later
call ZeroBuffer() because at that time we have the ReadBufferMode.

The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.

> Putting all that together, I propose:
>
> /*
> * Initiate reading a block from disk to the buffer cache.
> *
> * XXX: Until we have async I/O, this just allocates the buffer in the
> buffer
> * cache. The actual I/O happens in WaitReadBuffer().
> */
> Buffer
> StartReadBuffer(BufferManagerRelation bmr,
> ForkNumber forkNum,
> BlockNumber blockNum,
> BufferAccessStrategy strategy,
> ReadBufferMode mode,
> bool *foundPtr);
>
> /*
> * Wait for a read that was started earlier with StartReadBuffer() to
> finish.
> *
> * XXX: Until we have async I/O, this is the function that actually
> performs
> * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
> * will try to perform all of them in one syscall. Subsequent calls to
> * WaitReadBuffer(), for those other buffers, will finish quickly.
> */
> void
> WaitReadBuffer(Buffer buf);

I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O. In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os, and when you try to wait, if it hasn't
already been submitted to the kernel/bgworkers yet it will be, and
before that merging will happen. So you get bigger I/Os without
having to say so explicitly.

For this synchronous version (and hopefully soon a more efficient
improved version in the AIO branch), we want to take advantage of the
client's pre-existing and more cheaply obtained knowledge of ranges.

> I'm not sure how well this fits with the streaming read API. The
> streaming read code performs grouping of adjacent blocks to one
> CompleteReadBuffers() call. If WaitReadBuffer() does the batching,
> that's not really required. But does that make sense with async I/O?

> With async I/O, will you need a vectorized version of StartReadBuffer() too?

I think so, yes.

Attachment Content-Type Size
0001-fixup-Provide-vectored-variant-of-ReadBuffer.txt text/plain 4.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-01-11 03:22:37 Re: introduce dynamic shared memory registry
Previous Message Andy Fan 2024-01-11 03:17:52 Re: the s_lock_stuck on perform_spin_delay