Streaming I/O, vectored I/O (WIP)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Streaming I/O, vectored I/O (WIP)
Date: 2023-08-31 04:00:13
Message-ID: CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that. This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream". That way,
the streaming infrastructure can look as far into the future as it
wants, and then:

* systematically issue POSIX_FADV_WILLNEED for random access,
replacing patchy ad hoc advice
* build larger vectored I/Os; eg one preadv() call can replace 16 pread() calls

That's more efficient, and it goes faster. It's better even on
systems without 'advice' and/or vectored I/O support, because some
I/Os can be merged into wider simple pread/pwrite calls, and various
other small efficiencies come from batching.

The real goal, though, is to make it easier for later work to replace
the I/O subsystem with true asynchronous and concurrent I/O, as
required to get decent performance with direct I/O (and, at a wild
guess, the magic network smgr replacements that many of our colleagues
on this list work on). Client code such as access methods wouldn't
need to change again to benefit from that, as it would be fully
insulated by the streaming abstraction.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model. That's the thing with
the most potential users throughout the source tree and AMs, and I've
attached some work-in-progress examples of half a dozen use cases.

=== 1. Vectored I/O through the layers ===

* Provide vectored variants of FileRead() and FileWrite().
* Provide vectored variants of smgrread() and smgrwrite().
* Provide vectored variant of ReadBuffer().
* Provide multi-block smgrprefetch().

=== 2. Streaming read API ===

* Give SMgrRelation pointers a well-defined lifetime.
* Provide basic streaming read API.

=== 3. Example users of streaming read API ===

* Use streaming reads in pg_prewarm. [TM]
* WIP: Use streaming reads in heapam scans. [AF]
* WIP: Use streaming reads in vacuum. [AF]
* WIP: Use streaming reads in nbtree vacuum scan. [AF]
* WIP: Use streaming reads in bitmap heapscan. [MP]
* WIP: Use streaming reads in recovery. [TM]

=== 4. Some less developed work on vectored writes ===

* WIP: Provide vectored variant of FlushBuffer().
* WIP: Use vectored writes in checkpointer.

All of these are WIP; those marked WIP above are double-WIP. But
there's enough to demo the concept and discuss. Here are some
assorted notes:

* probably need to split block-count and I/O-count in stats system?
* streaming needs to "ramp up", instead of going straight to big reads
* the buffer pin limit is somewhat primitive
* more study of buffer pool correctness required
* 16 block/128KB size limit is not exactly arbitrary but not well
researched (by me at least)
* various TODOs in user patches

A bit about where this code came from and how it relates to the "AIO"
project[1]: The idea and terminology 'streaming I/O' are due to
Andres Freund. This implementation of it is mine, and to keep this
mailing list fun, he hasn't reviewed it yet. The example user patches
are by Andres, Melanie Plageman and myself, and were cherry picked
from the AIO branch, where they originally ran on top of Andres's
truly asynchronous 'streaming read', which is completely different
code. It has (or will have) exactly the same API, but it does much
more, with much more infrastructure. But the AIO branch is far too
much to propose at once.

We might have been a little influenced by a recent discussion on
pgsql-performance[2] that I could summarise as "why do you guys need
to do all this fancy AIO stuff, just give me bigger reads!". That was
actually a bit of a special case, I think (something is wrong with
btrfs's prefetch heuristics?), but in conversation we realised that
converting parts of PostgreSQL over to a stream-oriented model could
be done independently of AIO, and could offer some nice incremental
benefits already. So I worked on producing this code with an
identical API that just maps on to old fashioned synchronous I/O
calls, except bigger and better.

The "example user" patches would be proposed separately in their own
threads after some more work, but I wanted to demonstrate the wide
applicability of this style of API in this preview. Some of these
make use of the ability to attach a bit of extra data to each buffer
-- see Melanie's bitmap heapscan patch, for example. In later
revisions I'll probably just pick one or two examples to work with for
a smaller core patch set, and then the rest can be developed
separately. (We thought about btree scans too as a nice high value
area to tackle, but Tomas Vondra is hacking in that area and we didn't
want to step on his toes.)

[1] https://wiki.postgresql.org/wiki/AIO
[2] https://www.postgresql.org/message-id/flat/218fa2e0-bc58-e469-35dd-c5cb35906064%40gmx.net

Attachment Content-Type Size
v1-0001-Provide-vectored-variants-of-FileRead-and-FileWri.patch text/x-patch 5.3 KB
v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch text/x-patch 21.0 KB
v1-0003-Provide-vectored-variant-of-ReadBuffer.patch text/x-patch 27.7 KB
v1-0004-Provide-multi-block-smgrprefetch.patch text/x-patch 6.0 KB
v1-0005-Give-SMgrRelation-pointers-a-well-defined-lifetim.patch text/x-patch 9.8 KB
v1-0006-Provide-basic-streaming-read-API.patch text/x-patch 18.6 KB
v1-0007-Use-streaming-reads-in-pg_prewarm.patch text/x-patch 2.9 KB
v1-0008-WIP-Use-streaming-reads-in-heapam-scans.patch text/x-patch 13.5 KB
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch text/x-patch 19.5 KB
v1-0010-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch text/x-patch 6.0 KB
v1-0011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch text/x-patch 51.2 KB
v1-0012-WIP-Use-streaming-reads-in-recovery.patch text/x-patch 39.2 KB
v1-0013-WIP-Provide-vectored-variant-of-FlushBuffer.patch text/x-patch 13.3 KB
v1-0014-WIP-Use-vector-writes-in-checkpointer.patch text/x-patch 10.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-08-31 04:06:02 Re: New WAL record to detect the checkpoint redo location
Previous Message Denis Smirnov 2023-08-31 03:28:20 Re: Use virtual tuple slot for Unique node