Re: Streaming I/O, vectored I/O (WIP)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2024-03-26 12:40:03
Message-ID: 289a1c0e-8444-4009-a8c2-c2d77ced6f07@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24/03/2024 15:02, Thomas Munro wrote:
> On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
>> for a shorter name.
>
> Hmm. The idea of 'buffer' appearing in a couple of names is that
> there are conceptually other kinds of I/O that we might want to
> stream, like raw files or buffers other than the buffer pool, maybe
> even sockets, so this would be part of a family of similar interfaces.
> I think it needs to be clear that this variant gives you buffers. I'm
> OK with removing "get" but I guess it would be better to keep the
> words in the same order across the three functions? What about these?
>
> streaming_read_buffer_begin();
> streaming_read_buffer_next();
> streaming_read_buffer_end();
>
> Tried like that in this version. Other ideas would be to make
> "stream" the main noun, buffered_read_stream_begin() or something.
> Ideas welcome.

Works for me, although "streaming_read_buffer" is a pretty long prefix.
The flags like "STREAMING_READ_MAINTENANCE" probably ought to be
"STREAMING_READ_BUFFER_MAINTENANCE" as well.

Maybe "buffer_stream_next()"?

> Here are some other changes:
>
> * I'm fairly happy with the ABC adaptive distance algorithm so far, I
> think, but I spent more time tidying up the way it is implemented. I
> didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
> so I created a new dense array stream->buffers that behaved as a
> second circular queue.
>
> * The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
> become the GUC that it always wanted to be: buffer_io_size defaulting
> to 128kB. Seems like a reasonable thing to have? Could also
> influence things like bulk write? (The main problem I have with the
> GUC currently is choosing a category, async resources is wrong....)
>
> * By analogy, it started to look a bit funny that each range had room
> for a ReadBuffersOperation, and we had enough ranges for
> max_pinned_buffers * 1 block range. So I booted that out to another
> dense array, of size max_ios.
>
> * At the same time, Bilal and Andres had been complaining privately
> about 'range' management overheads showing up in perf and creating a
> regression against master on fully cached scans that do nothing else
> (eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
> and no CPU work with the page, a somewhat extreme case but a
> reasonable way to isolate the management costs); having made the above
> change, it suddenly seemed obvious that I should make the buffers
> array the 'main' circular queue, pointing off to another place for
> information required for dealing with misses. In this version, there
> are no more range objects. This feels better and occupies and touches
> less memory. See pictures below.

+1 for all that. Much better!

> * Various indexes and sizes that couldn't quite fit in uint8_t but
> couldn't possibly exceed a few thousand because they are bounded by
> numbers deriving from range-limited GUCs are now int16_t (while I was
> looking for low hanging opportunities to reduce memory usage...)

Is int16 enough though? It seems so, because:

max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.

If someone changes those constants though, int16 might overflow and fail
in weird ways. I'd suggest being more careful here and explicitly clamp
max_pinned_buffers at PG_INT16_MAX or have a static assertion or
something. (I think it needs to be somewhat less than PG_INT16_MAX,
because of the extra "overflow buffers" stuff and some other places
where you do arithmetic.)

> /*
> * We gave a contiguous range of buffer space to StartReadBuffers(), but
> * we want it to wrap around at max_pinned_buffers. Move values that
> * overflowed into the extra space. At the same time, put -1 in the I/O
> * slots for the rest of the buffers to indicate no I/O. They are covered
> * by the head buffer's I/O, if there is one. We avoid a % operator.
> */
> overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
> if (overflow > 0)
> {
> memmove(&stream->buffers[0],
> &stream->buffers[stream->max_pinned_buffers],
> sizeof(stream->buffers[0]) * overflow);
> for (int i = 0; i < overflow; ++i)
> stream->buffer_io_indexes[i] = -1;
> for (int i = 1; i < nblocks - overflow; ++i)
> stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
> }
> else
> {
> for (int i = 1; i < nblocks; ++i)
> stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
> }

Instead of clearing buffer_io_indexes here, it might be cheaper/simpler
to initialize the array to -1 in streaming_read_buffer_begin(), and
reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(),
after the WaitReadBuffers() call. In other words, except when an I/O is
in progress, keep all the elements at -1, even the elements that are not
currently in use.

Alternatively, you could remember the first buffer that the I/O applies
to in the 'ios' array. In other words, instead of pointing from buffer
to the I/O that it depends on, point from the I/O to the buffer that
depends on it. The last attached patch implements that approach. I'm not
wedded to it, but it feels a little simpler.

> if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
> {
> /* Distance ramps up fast (behavior C). */
> ...
> }
> else
> {
> /* No advice; move towards full I/O size (behavior B). */
> ...
> }

The comment on ReadBuffersOperation says "Declared in public header only
to allow inclusion in other structs, but contents should not be
accessed", but here you access the 'flags' field.

You also mentioned that the StartReadBuffers() argument list is too
long. Perhaps the solution is to redefine ReadBuffersOperation so that
it consists of two parts: 1st part is filled in by the caller, and
contains the arguments, and 2nd part is private to bufmgr.c. The
signature for StartReadBuffers() would then be just:

bool StartReadBuffers(ReadBuffersOperation *operation);

That would make it OK to read the 'flags' field. It would also allow
reusing the same ReadBuffersOperation struct for multiple I/Os for the
same relation; you only need to change the changing parts of the struct
on each operation.

In the attached patch set, the first three patches are your v9 with no
changes. The last patch refactors away 'buffer_io_indexes' like I
mentioned above. The others are fixes for some other trivial things that
caught my eye.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
v9.heikki-0001-Provide-vectored-variant-of-ReadBuffer.patch text/x-patch 37.5 KB
v9.heikki-0002-Provide-API-for-streaming-reads-of-relatio.patch text/x-patch 29.3 KB
v9.heikki-0003-Use-streaming-reads-in-pg_prewarm.patch text/x-patch 2.6 KB
v9.heikki-0004-Cleanup-boilerplate-file-header-comments.patch text/x-patch 2.5 KB
v9.heikki-0005-Tidy-up-includes.patch text/x-patch 1.3 KB
v9.heikki-0006-Mention-that-STREAMING_READ_MAINTENANCE-is.patch text/x-patch 1.0 KB
v9.heikki-0007-Trivial-comment-fixes.patch text/x-patch 1.3 KB
v9.heikki-0008-Use-wipe_mem-for-callback-data-that-should.patch text/x-patch 1.5 KB
v9.heikki-0009-Replace-buffer_io_indexes-array-with-ref-f.patch text/x-patch 9.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2024-03-26 12:55:04 Re: Possibility to disable `ALTER SYSTEM`
Previous Message Bertrand Drouvot 2024-03-26 12:35:04 Re: Introduce XID age and inactive timeout based replication slot invalidation