Re: AIO / read stream heuristics adjustments for index prefetching

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-02 12:30:26
Message-ID: CAN55FZ3GT+o545U9JLR0AC5JtBznqJP4Mf9Mi4k3axqqAXTxPg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, 2 Apr 2026 at 02:20, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> I've pushed what was 0001 and 0002. Will push the former 0003 shortly.

0001 LGTM.

0002: read_stream: Prevent distance from decaying too quickly

+ /*
+ * As we needed IO, prevent distance from being reduced within our
+ * maximum look-ahead window. This avoids having distance collapse too
+ * quickly in workloads where most of the required blocks are cached,
+ * but where the remaining IOs are a sufficient enough factor to cause
+ * a substantial slowdown if executed synchronously.
+ *
+ * There are valid arguments for preventing decay for max_ios or for
+ * max_pinned_buffers. But the argument for max_pinned_buffers seems
+ * clearer - if we can't see any misses within the maximum look-ahead
+ * distance, we can't do any useful read-ahead.
+ */
+ stream->distance_decay_holdoff = stream->max_pinned_buffers;

That is already committed but I have a question. Did you think about
setting stream->distance_decay_holdoff to current stream->distance?
This will also fix 'miss followed by a hit' (it won't fix double
missed followed by a hit, though).

0003: aio: io_uring: Trigger async processing for large IOs

There is a small optimization opportunity, we don't need to calculate
io_size for the DIO since pgaio_uring_should_use_async() will always
return false. Do you think it is worth implementing this? Other than
that, LGTM.

0004: read_stream: Only increase distance when waiting for IO

LGTM.

0005: WIP: read_stream: Move logic about IO combining & issuing to helpers

/* never pin more buffers than allowed */
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)
return false;

/*
* Don't start more read-ahead if that'd put us over the distance limit
* for doing read-ahead.
*/
if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->distance)
return false;

AFAIK, stream->distance can't be higher than
stream->max_pinned_buffers [1], so I think we can remove the first if
block. If we are not sure about [1], stream->max_pinned_buffers most
likely will be higher than stream->distance, it would make sense to
change the order of these.

Aha, I understood this change after looking at 0006. It is nitpick but
'if (stream->pinned_buffers + stream->pending_read_nblocks >=
stream->max_pinned_buffers)' block can be added in 0006. Other than
these, LGTM.

0006: WIP: read stream: Split decision about look ahead for AIO and combining

I liked the idea of being more aggressive to do IO combining. What is
the reason for gradually increasing combine_distance, is it to not do
unnecessary IOs at the start?

+ /*
+ * XXX: Should we actually reduce this at any time other than
+ * a reset? For now we have to, as this is also a condition
+ * for re-enabling fast_path.
+ */
+ if (stream->combine_distance > 1)
+ stream->combine_distance--;

I don't think we need to reduce this other than reset.

+ /*
+ * Allow looking further ahead if we have an the process of building a
+ * larger IO, the IO is not yet big enough and we don't yet have IO in
+ * flight. Note that this is allowed even if we are reaching the
+ * read-ahead limit (but not the buffer pin limit).
+ *
+ * This is important for cases where either effective_io_concurrency is
+ * low or we never need to wait for IO and thus are not increasing the
+ * distance. Without this we would end up with lots of small IOs.
+ */
+ if (stream->pending_read_nblocks > 0 &&
+ stream->pinned_buffers == 0 &&
+ stream->pending_read_nblocks < stream->combine_distance)
+ return true;

Do we need to check 'stream->pinned_buffers == 0' here? I think we can
continue to look ahead although there are pinned buffers as we already
know 'stream->pinned_buffers + stream->pending_read_nblocks <
stream->max_pinned_buffers'.

+ /* same if capped not by io_combine_limit but combine_distance */
+ if (stream->combine_distance > 0 &&
+ pending_read_nblocks >= stream->combine_distance)
+ return true;

I think we don't need to check 'stream->combine_distance > 0', it
shouldn't be less or equal than zero when the
'stream->readahead_distance' is not zero, right?

+ {
+ stream->readahead_distance = Min(max_pinned_buffers,
stream->io_combine_limit);
+ stream->combine_distance = stream->io_combine_limit;
+ }

I think the stream->combine_distance should be equal to
stream->readahead_distance as we don't want to surpass
max_pinned_buffers.

0007: Hacky implementation of making read_stream_reset()/end() not wait for IO

Read stream parts LGTM. I don't have a comment on the FIXME part, though.

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2026-04-02 12:34:23 Re: [PATCH] Add prepared_orphaned_transaction_timeout GUC
Previous Message Devrim Gündüz 2026-04-02 12:26:56 Re: LLVM 22