Re: AIO / read stream heuristics adjustments for index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-03 20:36:03
Message-ID: 24bjkmnkuapbs7wvcecvtrb3gvbrzg3extlkzpbg2f7dwt7h42@3e4vg6cd33iw
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-04-03 15:04:51 -0400, Melanie Plageman wrote:
> On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > > - why not remove the combine_distance requirement from the fast path
> > > entry criteria (could save resume_combine_distance in the fast path
> > > and restore it after a miss)
> >
> > Because entering the fast path prevents IO combining from happening. So it's
> > absolutely crucial that we do *not* enter it while we would combine.
>
> But if it is a buffer hit, obviously we can't do IO combining anyway,
> or am I misunderstanding the fast path's common case?

It's true that we can't do combining in the fast path, but the problem is that
with eic=0/1 (or a recent history that leaves us with low distances or a low
pin limit), we will not start the next IO until there are no more buffers
pinned.

Imagine that we started one 16 block IO and have a readahead_distance of
1. After consuming 15 buffers, we will have one more buffer pinned, but
read_stream_look_ahead() will not yet start another IO, due to the
readahead_distance condition (or max_pinned_buffers or ...). Without the
stream->combine_distance == 1 check, the subsequent check for
read_stream_next_buffer() would consider this a valid case for entering
fast-path.

> > > You mentioned that we don't want to read too far ahead (including for
> > > a single combined IO) in part because:
> > >
> > > > The resowner and private refcount mechanisms take more CPU cycles if you have
> > > > more buffers pinned
> > >
> > > But I don't see how either distance is responding to this or
> > > self-calibrating with this in mind
> >
> > Using the minimal required distance to avoid needing to wait for IO completion
> > is responding to that, no? Without these patches we read ahead as far as
> > possible, even if all the data is in the page cache, which makes this issue
> > way worse (without these patches it's a major source of regressions in the
> > index prefetching patch).
>
> But we aren't using the minimal distance to avoid needing to wait for
> IO completion. We are also using a higher distance to try and get IO
> combining and toallow for async copying into the kernel buffer cache,
> etc, etc.

My testing suggests that doing IO combining for a reasonble io_combine_limit
is pretty much always a win in a steady-state stream (i.e. not a short one
that's not fully consumed), the gain from avoiding the larger amounts of
syscalls sufficiently large.

One we start doing async copying from the kernel page cache, we will have to
wait for the completion of that async work, which will lead to
readahead_distance being increased if necessary.

> There's a lot of different considerations; it isn't just two opposing
> forces.

It's not, but I think always performing io_combine_limit sized IOs after a
ramp-up and increasing the distance based on needing to wait is a pretty
decent heuristic.

For best results it does require pgaio_uring_should_use_async() to trigger, as
otherwise we do not get get the parallelized memory copy. Which means it may
never trigger if we don't occasionally reach the size based condition. Luckily
it does not seem like using async is beneficial for small IOs.

> And, I'd imagine that the relationship between the
> number of buffers pinned and CPU cycles required for resowner/refcount
> isn't perfectly linear.

It's definitely not.

> I'm not saying that we don't do IO combining at high distances, I'm
> more saying that it is confusing that combine_distance controls how
> far we look ahead when readahead_distance is low but when
> readahead_distance is high, it controls when we issue the IO and not
> how far we look ahead. I don't think we should change course now, but
> I wanted to call out that this felt a little uncomfortable to me.

I'm not sure I see an alternative. I tried to at least improve the comments
around this.

Attached are a revised set of commits. The largest changes are:

- Reordered the series to put
"read_stream: Only increase read-ahead distance when waiting for IO"
after
"stream: Split decision about look ahead for AIO and combining"

Previously I thought it'd be too awkward from a comment perspective, but
there's only one comment where it is a bit odd.

Think it's much clearer this way.

- Largely rewrote "Hacky implementation of making read_stream_reset()/end()
not wait for IO". Looks a lot saner now.

Think this needs a few more tests, in particular for the read stream and
foreign_io paths. Will do that in the next version.

- Tried to address most of Bilal's and Melanie's feedback

- Removed some redundant checks from read_stream_should_issue_now()

- Lots of comment polishing, including revising the top-level read_stream.c
comment

Greetings,

Andres Freund

Attachment Content-Type Size
v5-0001-aio-io_uring-Trigger-async-processing-for-large-I.patch text/x-diff 7.4 KB
v5-0002-read_stream-Move-logic-about-IO-combining-issuing.patch text/x-diff 5.0 KB
v5-0003-read-stream-Split-decision-about-look-ahead-for-A.patch text/x-diff 12.9 KB
v5-0004-read_stream-Only-increase-read-ahead-distance-whe.patch text/x-diff 8.4 KB
v5-0005-Allow-read_stream_reset-to-not-wait-for-IO-comple.patch text/x-diff 20.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-04-03 20:41:18 Re: AIO / read stream heuristics adjustments for index prefetching
Previous Message Robert Treat 2026-04-03 19:51:45 Re: Docs: Distinguish table and index storage parameters in CREATE TABLE