Re: AIO / read stream heuristics adjustments for index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-04 01:24:55
Message-ID: d3dretyqf43c45ajf5ul7n2z77kdj4ka2bq5hey2l5grx7i74y@tvvoqch2uc7a
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-04-03 20:06:32 -0400, Melanie Plageman wrote:
> On Fri, Apr 3, 2026 at 4:36 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> typo in comment in read_stream_look_ahead()
> "even if we are reached the read-ahead limit" I think you meant "even
> if we have reached the read-ahead limit"
>
> I didn't really notice this before
> * doing full io_combine_limit sized reads.
> */
> if (flags & READ_STREAM_FULL)
> - stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
> + {
> + stream->readahead_distance = Min(max_pinned_buffers,
> stream->io_combine_limit);
> + stream->combine_distance = Min(max_pinned_buffers,
> stream->io_combine_limit);
> + }
>
> that distance started out at io_combine_limit for READ_STREAM_FULL.
> That doesn't make that much sense to me, wouldn't we set
> readahead_distance to max_pinned_buffers() for READ_STREAM_FULL?

Yea, I agree, it doesn't seem quite right to me. I raised that a while ago
too. But it's something nontrivial that I think should be changed separately.

> + /*
> + * Allow looking further ahead if we are in the process of building a
> + * larger IO, the IO is not yet big enough and we don't yet have IO in
> + * flight. Note that this is allowed even if we are reached the
> + * read-ahead limit (but not the buffer pin limit, combine_distance is
> + * capped by it and we are checking for pinned_buffers == 0).
> + *
> + * The reason this is restricted to not yet having an IO in flight is that
> + * once we are actually reading ahead, we will not issue IOs before they
> + * have reached the full size (or can't be grown further). But we *have*
> + * to issue an IO once pinned_buffers == 0, otherwise there won't be a
> + * buffer to return to the caller.
> + *
> + * This is important for cases where either effective_io_concurrency is
> + * low or we never need to wait for IO and thus are not increasing the
> + * distance. Without this we would end up with lots of small IOs.
> + */
> + if (stream->pending_read_nblocks > 0 &&
> + stream->pinned_buffers == 0 &&
> + stream->pending_read_nblocks < stream->combine_distance)
> + return true;
> +
>
> I'd strongly recommend an oxford comma here for clarity:
>
> "larger IO, the IO is not yet big enough and we don't yet have IO in" ->
> "larger IO, the IO is not yet big enough, and we don't yet have IO in"

Done.

> when you say "combine_distance is capped by it" -- what is it here? Do
> you mean max_pinned_buffers? It is ambiguous like maybe you could mean
> the read-ahead limit you just mentioned

max_pinned_buffers, yes.

> Maybe rephrase that whole parenthetical like
> "Note that this can exceed the read-ahead limit. It cannot exceed the
> buffer pin limit because pinned_buffers == 0 and combine_distance is
> capped by max_pinned_buffers."

> And then the next paragraph also is confusing
>
> * The reason this is restricted to not yet having an IO in flight is that
> * once we are actually reading ahead, we will not issue IOs before they
> * have reached the full size (or can't be grown further). But we *have*
> * to issue an IO once pinned_buffers == 0, otherwise there won't be a
> * buffer to return to the caller.
>
> "restricted to not yet having an IO in flight is awkward"
> Maybe just say which part you are referring to (pinned_buffers == 0)

(I rewrote the entire comment to make the goal clear, based on your question
below)

> My understanding about the pinned_buffers == 0 criteria, is that we
> specifically want to perform the look-ahead even if we would exceed
> readahead_distance when this is the first IO -- and we haven't
> submitted other I/Os yet.

This is much more common than just the first IO. If we end up not needing to
wait for IO (e.g. because the scan does something CPU intensive with the
returned pages, or because io_method=io_uring executes IO synchronously, or
...), readahead_distance just stays at 1. But we do want to do IO combining
to the full extent, and may do so, because combine_distance ==
io_combine_limit after a few IOs.

If readahead_distance == 1, read_tream_next_buffer()->read_stream_look_ahead()
will not start another IO until pinned_buffers == 0. Without the
combine_distance logic, we would then issue an IO of exactly size 1. But with
the combine_distance logic, we will at that point build a combine_distance
sized IO.

We should not do this once pinned_buffers > 0:

- It's not needed, because we won't issue the pending IO until it is big
enough, as long as pinned_buffers > 0. That may mean that IO concurrency is
"too low", but if so, we will notice that due to WaitReadBuffers() returning
true, leading to an increased readahead_distance.

- It'd lead to reading further ahead than there's a need to. If we e.g. have a
readahead_distance of 16, and already have 15 buffers pinned,
read_stream_should_look_ahead() will start a pending read, but (with the
code in the patch or in master) won't actually issue the read, until the
caller has consumed more of the buffers (reducing max_pinned_buffers,
allowing to increase the size of the pending read). We'd do this until we
either grew the IO to the max size or the next block can't be combined.

What's the argument for issuing eagerly in this case?

> 0005 - Allow read_stream_reset() to not wait for IO completion
>
> I can't comment on whether or not doing this in resowner cleanup is a
> good architectural decision, as I don't know enough about that part of
> postgres.

I think I'm pretty OK with that part now, after chewing on it for a good
while.

We already did things like TerminateBufferIO() as part of resowner cleanup
before AIO (c.f. ResOwnerReleaseBufferIO()->AbortBufferIO()). Waiting for
in-flight IOs is basically the equivalent thing in an AIO world.

> /* Stop looking ahead. */
> - stream->readahead_distance = 0;
> - stream->combine_distance = 0;
> + stream->readahead_distance = -1;
> + stream->combine_distance = -1;
>
> I don't remember why we couldn't have a separate boolean field for
> end-of-stream/stream-paused/stream-reset-in-progress? Did you ever
> figure it out?

I don't think there really is a good reason, but it's a not entirely trivial
change.

> Separately, maybe we should forbid calling pgaio_wref_abandon() on IOs
> that haven't been submitted yet.

Yea, that should probably assert out. I can't see any reason for that to be
needed.

> And I feel like weird stuff could happen if we let people call
> pgaio_wref_abandon() on defined or staged IOs. For example, resowner cleanup
> may take a very long time because of submitting and waiting for a bunch of
> IOs. And IO handles are released before locks, and I think we don't want to
> spend a long time waiting cleaning up IO handles while holding locks.

I don't find that a concern with the current users of AIO - outside of error
paths we simply never have IOs that live beyond the resowner.

I think the argument for holding a lock is actually a *pro* argument for
waiting during resowner cleanup: That way the AIO subsystem will not hold onto
buffer pins for relations that the issuing bckend does not hold a lock for.

> Then I was thinking maybe you _do_ want to allow calling
> pgaio_wref_abandon() on IOs that haven't been submitted because
> pgaio_io_release() only works on PGAIO_HS_HANDED_OUT IOs and not on
> defined or staged ones

I don't know what the potential use case for that would be. Why would you
stage an IO that you then immediately (after all you're not allowed to do very
much when in batch mode) want to abandon. If that's needed, just submit the
IO.

> pgaio_wref_abandon() does not invalidate the wait ref (which seems
> fine to have the caller do it), but you should mention that and make
> it clear that it doesn't do anything to the wait ref itself.

Hm. We don't state that either for pgaio_wref_wait().

I wonder if we should make the iow argument for pgaio_wref_wait(),
pgaio_wref_check_done(), pgaio_wref_abandon(), pgaio_wref_valid(),
pgaio_wref_get_id() const to make that clear on a type-system level?

> This isn't totally clear especially because your comment says

> * Declare that a wait reference to an IO, started by this backend, is not of
> * interest to this backend anymore.
>
> really what it is doing is declaring that the caller is no longer
> interested in receiving the IO result

Do you think we'd need a comment in addition to the const?

> and a weird future caller might want to abandon checking the result
> and still wait for completion separately on the wait ref, so making it
> sound like this function disables that isn't right either

It's not even that crazy to do that and it's kinda done today: Even after
abandoning the IO as part of a read stream reset, we might need to wait for
the buffer to be ready as part of a different ReadBuffersOperation().

> In pgaio_wref_abandon(), it seems like you could get rid of this guard
> if (ioh->report_return)
> for clarity. in the context of pgaio_wref_abandon() it shouldn't ever
> be NULL, right? Then it doesn't save you anything and implying it
> could be NULL here is confusing

It can be NULL here. It's legit to simply pass ret=NULL to
pgaio_io_acquire_nb().

But of course that doesn't mean we can't unconditionally set it to NULL. I'm
slightly hesitant to do that just in pgaio_wref_abandon(), as
pgaio_io_release_resowner() already has the NULL check before this commit.

> + HOLD_INTERRUPTS();
> +
> + /*
> + * It is safe to perform this check before checking if the IO was recycled
> + * as the owner of an IO cannot change.
> + */
> + if (!am_owner)
> + elog(ERROR, "only IOs owned by current backend can be abandoned");
> +
>
> no reason for the error to be after the HOLD_INTERRUPTS(), right?

Not that I can see.

> I thought it made it harder to understand the comment about why we need to
> hold interrupts here and if it can just be moved above, then that's better

K.

> In AbandonReadBuffers() header comment:
> "This needs to called" -> "This needs to be called"
>
> I do find it a little weird that if someone calls AbandonReadBuffers()
> and doesn't remember to release the buffer pin, it won't get
> released...

You do get a warning about it if you do forget (about buffer pins not being
released at the end of the statement). I don't think it'd make any code
simpler if we made AbandonReadBuffers() release the pins, as read_stream.c
already needs to think about the buffer pins it holds for buffers it did not
abandon (e.g. because no IO was necessary). I think the control flow to
transport that upwards whether we need to release pins or not would end up
quite awkward.

> I wonder if we are going to want the option for someone to abandon all
> the buffers in read_stream_pause()?

I can't see what that would be good for? The buffers that would be returned
by the read stream wouldn't necessarily be BM_VALID, and I don't think code
outside of low level stuff like bufmgr.c and read_stream.c should have to deal
with !BM_VALID buffers.

> Right now if you call read_stream_reset() it overrides the resume_distance,
> so if you wanted to restart the stream abandoning all existing IOs but keep
> up the distance, you won't be able to do that. It will revert down to 1.

I think if that's an issue we'd want to have a reset variant that keeps the
distance, not make pause extremely hard to use...

> @@ -538,7 +539,7 @@ read_stream_should_issue_now(ReadStream *stream)
> * If the callback has signaled end-of-stream, start the pending read
> * immediately. There is no further potential for IO combining.
> */
> - if (stream->readahead_distance == 0)
> + if (stream->readahead_distance <= 0)
> return true;
>
> I don't think you should get here when readahead_distance < 0. Maybe
> we should assert that?

I think we do right now, if there is an incomplete pending read. It looks not
entirely trivial to change that...

Thanks for the review!

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Lukas Fittl 2026-04-04 02:09:00 Re: Add custom EXPLAIN options support to auto_explain
Previous Message Masahiko Sawada 2026-04-04 01:11:58 Re: POC: Parallel processing of indexes in autovacuum