Re: AIO / read stream heuristics adjustments for index prefetching

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-04 00:06:32
Message-ID: CAAKRu_YDvFKd62JbvgavHzZke1Nbt5maH56aG+bNzozY7qvRvw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 3, 2026 at 4:36 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Attached are a revised set of commits.

0001 aio: io_uring: Trigger async processing for large IOs
LGTM

0002 read_stream: Move logic about IO combining & issuing to helpers
LGTM

0003 read stream: Split decision about look ahead for AIO and combining

Stale comment reference to stream->distance around line 505

Typo above ReadStream->combine_distance: "The limits for read-ahead an
combining" should be "and combining".

typo in comment in read_stream_look_ahead()
"even if we are reached the read-ahead limit" I think you meant "even
if we have reached the read-ahead limit"

I didn't really notice this before
* doing full io_combine_limit sized reads.
*/
if (flags & READ_STREAM_FULL)
- stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
+ {
+ stream->readahead_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+ stream->combine_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+ }

that distance started out at io_combine_limit for READ_STREAM_FULL.
That doesn't make that much sense to me, wouldn't we set
readahead_distance to max_pinned_buffers() for READ_STREAM_FULL?

+ /*
+ * Allow looking further ahead if we are in the process of building a
+ * larger IO, the IO is not yet big enough and we don't yet have IO in
+ * flight. Note that this is allowed even if we are reached the
+ * read-ahead limit (but not the buffer pin limit, combine_distance is
+ * capped by it and we are checking for pinned_buffers == 0).
+ *
+ * The reason this is restricted to not yet having an IO in flight is that
+ * once we are actually reading ahead, we will not issue IOs before they
+ * have reached the full size (or can't be grown further). But we *have*
+ * to issue an IO once pinned_buffers == 0, otherwise there won't be a
+ * buffer to return to the caller.
+ *
+ * This is important for cases where either effective_io_concurrency is
+ * low or we never need to wait for IO and thus are not increasing the
+ * distance. Without this we would end up with lots of small IOs.
+ */
+ if (stream->pending_read_nblocks > 0 &&
+ stream->pinned_buffers == 0 &&
+ stream->pending_read_nblocks < stream->combine_distance)
+ return true;
+

I'd strongly recommend an oxford comma here for clarity:

"larger IO, the IO is not yet big enough and we don't yet have IO in" ->
"larger IO, the IO is not yet big enough, and we don't yet have IO in"

when you say "combine_distance is capped by it" -- what is it here? Do
you mean max_pinned_buffers? It is ambiguous like maybe you could mean
the read-ahead limit you just mentioned

Maybe rephrase that whole parenthetical like
"Note that this can exceed the read-ahead limit. It cannot exceed the
buffer pin limit because pinned_buffers == 0 and combine_distance is
capped by max_pinned_buffers."

And then the next paragraph also is confusing

* The reason this is restricted to not yet having an IO in flight is that
* once we are actually reading ahead, we will not issue IOs before they
* have reached the full size (or can't be grown further). But we *have*
* to issue an IO once pinned_buffers == 0, otherwise there won't be a
* buffer to return to the caller.

"restricted to not yet having an IO in flight is awkward"
Maybe just say which part you are referring to (pinned_buffers == 0)

My understanding about the pinned_buffers == 0 criteria, is that we
specifically want to perform the look-ahead even if we would exceed
readahead_distance when this is the first IO -- and we haven't
submitted other I/Os yet. Like when eic is low but we still want to do
read combining. But this comment makes it sound like it is a control
flow issue (e.g. that there "won't be a buffer to return to the
caller")

I asked AI to come up with something better, and this is what it said:

"The pinned_buffers == 0 restriction is important: if we already have
buffers to return to the caller, we can let the normal issue logic in
read_stream_should_issue_now() decide when to submit. But if we have
no pinned buffers, we must keep building the pending IO until it
reaches combine_distance (or can't grow further), because otherwise
we'd submit a small IO and have nothing to return while waiting for
more."

I'm not sure it's 100% correct, but maybe you can rework your version
of the paragraph a bit.

0004 read_stream: Only increase read-ahead distance when waiting for IO

As of this commit READ_STREAM_FULL will never actually get up to the
max_pinned_buffers if needed_wait is never true. That's probably fine,
but it is different than before. Anyway, LGTM.

0005 - Allow read_stream_reset() to not wait for IO completion

I can't comment on whether or not doing this in resowner cleanup is a
good architectural decision, as I don't know enough about that part of
postgres.

/* Stop looking ahead. */
- stream->readahead_distance = 0;
- stream->combine_distance = 0;
+ stream->readahead_distance = -1;
+ stream->combine_distance = -1;

I don't remember why we couldn't have a separate boolean field for
end-of-stream/stream-paused/stream-reset-in-progress? Did you ever
figure it out?

Separately, maybe we should forbid calling pgaio_wref_abandon() on IOs
that haven't been submitted yet. You can get a wref before the IO is
submitted. And I feel like weird stuff could happen if we let people
call pgaio_wref_abandon() on defined or staged IOs. For example,
resowner cleanup may take a very long time because of submitting and
waiting for a bunch of IOs. And IO handles are released before locks,
and I think we don't want to spend a long time waiting cleaning up IO
handles while holding locks.

Then I was thinking maybe you _do_ want to allow calling
pgaio_wref_abandon() on IOs that haven't been submitted because
pgaio_io_release() only works on PGAIO_HS_HANDED_OUT IOs and not on
defined or staged ones

If you don't forbid calling pgaio_wref_abandon on unsubmitted IOs,
then you'd probably want to restructure the code in
ResourceOwnerReleaseInternal() to loop through and submit any
unsubmitted IO first and then wait for IOs after.

pgaio_wref_abandon() does not invalidate the wait ref (which seems
fine to have the caller do it), but you should mention that and make
it clear that it doesn't do anything to the wait ref itself. This
isn't totally clear especially because your comment says

* Declare that a wait reference to an IO, started by this backend, is not of
* interest to this backend anymore.

really what it is doing is declaring that the caller is no longer
interested in receiving the IO result

and a weird future caller might want to abandon checking the result
and still wait for completion separately on the wait ref, so making it
sound like this function disables that isn't right either

In pgaio_wref_abandon(), it seems like you could get rid of this guard
if (ioh->report_return)
for clarity. in the context of pgaio_wref_abandon() it shouldn't ever
be NULL, right? Then it doesn't save you anything and implying it
could be NULL here is confusing

+ HOLD_INTERRUPTS();
+
+ /*
+ * It is safe to perform this check before checking if the IO was recycled
+ * as the owner of an IO cannot change.
+ */
+ if (!am_owner)
+ elog(ERROR, "only IOs owned by current backend can be abandoned");
+

no reason for the error to be after the HOLD_INTERRUPTS(), right? I
thought it made it harder to understand the comment about why we need
to hold interrupts here and if it can just be moved above, then that's
better

In AbandonReadBuffers() header comment:
"This needs to called" -> "This needs to be called"

I do find it a little weird that if someone calls AbandonReadBuffers()
and doesn't remember to release the buffer pin, it won't get
released...
and in our case, it doesn't just happen in the caller -- it happens in
the caller's caller

I wonder if we are going to want the option for someone to abandon all
the buffers in read_stream_pause()? Right now if you call
read_stream_reset() it overrides the resume_distance, so if you wanted
to restart the stream abandoning all existing IOs but keep up the
distance, you won't be able to do that. It will revert down to 1.

@@ -538,7 +539,7 @@ read_stream_should_issue_now(ReadStream *stream)
* If the callback has signaled end-of-stream, start the pending read
* immediately. There is no further potential for IO combining.
*/
- if (stream->readahead_distance == 0)
+ if (stream->readahead_distance <= 0)
return true;

I don't think you should get here when readahead_distance < 0. Maybe
we should assert that?

I didn't look at any of the test stuff yet.

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2026-04-04 00:14:49 Re: pg_plan_advice
Previous Message Heikki Linnakangas 2026-04-03 23:58:42 Re: Shared hash table allocations