Re: Streaming I/O, vectored I/O (WIP)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2023-09-27 18:33:15
Message-ID: b336fbe8-1ad0-c289-5ed8-bbb74513e5bb@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31/08/2023 07:00, Thomas Munro wrote:
> Currently PostgreSQL reads (and writes) data files 8KB at a time.
> That's because we call ReadBuffer() one block at a time, with no
> opportunity for lower layers to do better than that. This thread is
> about a model where you say which block you'll want next with a
> callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, as
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:

13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I
hoped that this would simplify it. I didn't look closely at that patch,
so maybe it's simpler even though it's more code.

> There are more kinds of streaming I/O that would be useful, such as
> raw unbuffered files, and of course writes, and I've attached some
> early incomplete demo code for writes (just for fun), but the main
> idea I want to share in this thread is the idea of replacing lots of
> ReadBuffer() calls with the streaming model.

All this makes sense. Some random comments on the patches:

> + /* Avoid a slightly more expensive kernel call if there is no benefit. */
> + if (iovcnt == 1)
> + returnCode = pg_pread(vfdP->fd,
> + iov[0].iov_base,
> + iov[0].iov_len,
> + offset);
> + else
> + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.

> v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

> /*
> * Prepare to read a block. The buffer is pinned. If this is a 'hit', then
> * the returned buffer can be used immediately. Otherwise, a physical read
> * should be completed with CompleteReadBuffers(). PrepareReadBuffer()
> * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
> * caller has the opportunity to coalesce reads of neighboring blocks into one
> * CompleteReadBuffers() call.
> *
> * *found is set to true for a hit, and false for a miss.
> *
> * *allocated is set to true for a miss that allocates a buffer for the first
> * time. If there are multiple calls to PrepareReadBuffer() for the same
> * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
> * read, then only the first such call will receive *allocated == true, which
> * the caller might use to issue just one prefetch hint.
> */
> Buffer
> PrepareReadBuffer(BufferManagerRelation bmr,
> ForkNumber forkNum,
> BlockNumber blockNum,
> BufferAccessStrategy strategy,
> bool *found,
> bool *allocated)
>

If you decide you don't want to perform the read, after all, is there a
way to abort it without calling CompleteReadBuffers()? Looking at the
later patch that introduces the streaming read API, seems that it
finishes all the reads, so I suppose we don't need an abort function.
Does it all get cleaned up correctly on error?

> /*
> * Convert an array of buffer address into an array of iovec objects, and
> * return the number that were required. 'iov' must have enough space for up
> * to PG_IOV_MAX elements.
> */
> static int
> buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
The comment is a bit inaccurate. There's an assertion that If nblocks
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for
up to PG_IOV_MAX elements, that's only because we also assume that
nblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) to
prevent them from passing nblocks > PG_IOV_MAX.

in streaming_read.h:

> typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> uintptr_t pgsr_private,
> void *per_io_private,
> BufferManagerRelation *bmr,
> ForkNumber *forkNum,
> BlockNumber *blockNum,
> ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

> extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
> extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
> extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
> extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the
WAL replay prefetching patch, and only after calling
pg_streaming_read_reset(). Could pg_streaming_read_reset() call
pg_streaming_read_prefetch() directly? Is there any need to "reset" the
stream, without also starting prefetching?

In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

> @@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
> * If the WAL record contains a block reference with the given ID, *rlocator,
> * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
> * returns true. Otherwise returns false.
> + *
> + * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
> + * of the pin is transferred to the caller.
> */
> bool
> XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
> @@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
> if (blknum)
> *blknum = bkpb->blkno;
> if (prefetch_buffer)
> + {
> *prefetch_buffer = bkpb->prefetch_buffer;
> +
> + /*
> + * Clear this flag is so that we can assert that redo records take
> + * ownership of all buffers pinned by xlogprefetcher.c.
> + */
> + bkpb->prefetch_buffer = InvalidBuffer;
> + }
> return true;
> }

Could these changes be committed independently of all the other changes?

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2023-09-27 18:36:23 Re: Eager page freeze criteria clarification
Previous Message Peter Geoghegan 2023-09-27 18:23:47 Re: Eager page freeze criteria clarification