Re: Streaming I/O, vectored I/O (WIP)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2024-03-28 20:45:08
Message-ID: 98f01395-fc60-4d50-89c9-fc3011fa2a45@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I spent most of the past few days trying to regain some lost
> performance. Thanks to Andres for some key observations and help!
> That began with reports from Bilal and Melanie (possibly related to
> things Tomas had seen too, not sure) of regressions in all-cached
> workloads, which I already improved a bit with the ABC algorithm that
> minimised pinning for this case. That is, if there's no recent I/O so
> we reach what I call behaviour A, it should try to do as little magic
> as possible. But it turns out that wasn't enough! It is very hard to
> beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
> millions of already-cached blocks, if you have to do exactly the same
> work AND extra instructions for management.

I got a little nerd-sniped by that, and did some micro-benchmarking of
my own. I tested essentially this, with small values of 'nblocks' so
that all pages are in cache:

for (int i = 0; i < niters; i++)
{
for (BlockNumber blkno = 0; blkno < nblocks; blkno++)
{
buf = ReadBuffer(rel, blkno);
ReleaseBuffer(buf);
}
}

The results look like this (lower is better, test program and script
attached):

master (213c959a29): 8.0 s
streaming-api v13: 9.5 s

This test exercises just the ReadBuffer() codepath, to check if there is
a regression there. It does not exercise the new streaming APIs.

So looks like the streaming API patches add some overhead to the simple
non-streaming ReadBuffer() case. This is a highly narrow
micro-benchmark, of course, so even though this is a very
performance-sensitive codepath, we could perhaps accept a small
regression there. In any real workload, you'd at least need to take the
buffer lock and read something from the page.

But can we do better? Aside from performance, I was never quite happy
with the BMR_REL/BMR_SMGR stuff we introduced in PG v16. I like having
one common struct like BufferManagerRelation that is used in all the
functions, instead of having separate Relation and SMgrRelation variants
of every function. But those macros feel a bit hacky and we are not
consistently using them in all the functions. Why is there no
ReadBuffer() variant that takes a BufferManagerRelation?

The attached patch expands the use of BufferManagerRelations. The
principle now is that before calling any bufmgr function, you first
initialize a BufferManagerRelation struct, and pass that to the
function. The initialization is done by the InitBMRForRel() or
InitBMRForSmgr() function, which replace the BMR_REL/BMR_SMGR macros.
They are full-blown functions now because they do more setup upfront
than BMR_REL/BMR_SMGR. For example, InitBMRForRel() always initializes
the 'smgr' field, so that you don't need to repeat this pattern in all
the other functions:

- /* Make sure our bmr's smgr and persistent are populated. */
- if (bmr.smgr == NULL)
- {
- bmr.smgr = RelationGetSmgr(bmr.rel);
- bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
- }

Initializing the BufferManagerRelation is still pretty cheap, so it's
feasible to call it separately for every ReadBuffer() call. But you can
also reuse it across calls, if you read multiple pages in a loop, for
example. That saves a few cycles.

The microbenchmark results with these changes:

master (213c959a29): 8.0 s
streaming-api v13: 9.5 s
bmr-refactor 8.4 s
bmr-refactor, InitBMR once 7.7 s

The difference between the "bmr-refactor" and "initBMR once" is that in
the "initBMR once" test, I modified the benchmark to call
InitBMRForRel() just once, outside the loop. So that shows the benefit
of reusing the BufferManagerRelation. This refactoring seems to make
performance regression smaller, even if you don't take advantage of
reusing the BufferManagerRelation.

This also moves things around a little in ReadBuffer_common() (now
called ReadBufferBMR). Instead of calling StartReadBuffer(), it calls
PinBufferForBlock() directly. I tried doing that before the other
refactorings, but that alone didn't seem to make much difference. Not
sure if it's needed, it's perhaps an orthogonal refactoring, but it's
included here nevertheless.

What do you think? The first three attached patches are your v13 patches
unchanged. The fourth is the micro-benchmark I used. The last patch is
the interesting one.

PS. To be clear, I'm happy with your v13 streaming patch set as it is. I
don't think this BufferManagerRelation refactoring is a show-stopper.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
v13-bmr-0001-Provide-vectored-variant-of-ReadBuffer.patch text/x-patch 38.8 KB
v13-bmr-0002-Provide-API-for-streaming-relation-data.patch text/x-patch 33.2 KB
v13-bmr-0003-Use-streaming-I-O-in-pg_prewarm.patch text/x-patch 2.5 KB
v13-bmr-0004-Add-readbufferbench.patch text/x-patch 8.0 KB
v13-bmr-0005-Refactor-how-BufferManagerRelations-are-used.patch text/x-patch 52.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-03-28 20:45:17 Re: pg_upgrade --copy-file-range
Previous Message Bruce Momjian 2024-03-28 20:24:05 Re: Possibility to disable `ALTER SYSTEM`