Re: AIO / read stream heuristics adjustments for index prefetching

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-03 19:04:51
Message-ID: CAAKRu_b24CSa1ja8yEu+5PnTpWcpL0EFdRL1TwQye7VffvYjJA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 3, 2026 at 1:30 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > - why not remove the combine_distance requirement from the fast path
> > entry criteria (could save resume_combine_distance in the fast path
> > and restore it after a miss)
>
> Because entering the fast path prevents IO combining from happening. So it's
> absolutely crucial that we do *not* enter it while we would combine.

But if it is a buffer hit, obviously we can't do IO combining anyway,
or am I misunderstanding the fast path's common case?

> > > In my experiments the batching was primarily useful to allow to reduce the
> > > command submission & interrupt overhead when interacting with storage, i.e.
> > > when actually doing IO, not just copying from the page cache.
> >
> > It still seems to me that batching would help guarantee parallel
> > memcpys for workers when data is in the kernel buffer cache because if
> > the IO is quick, the same worker may pick up the next IO.
>
> When you say workers, do you mean for io_method=worker? Or the internal kernel
> threads of io_uring for async IOs?

I'm talking about io_method worker.

> > You mentioned that we don't want to read too far ahead (including for
> > a single combined IO) in part because:
> >
> > > The resowner and private refcount mechanisms take more CPU cycles if you have
> > > more buffers pinned
> >
> > But I don't see how either distance is responding to this or
> > self-calibrating with this in mind
>
> Using the minimal required distance to avoid needing to wait for IO completion
> is responding to that, no? Without these patches we read ahead as far as
> possible, even if all the data is in the page cache, which makes this issue
> way worse (without these patches it's a major source of regressions in the
> index prefetching patch).

But we aren't using the minimal distance to avoid needing to wait for
IO completion. We are also using a higher distance to try and get IO
combining and toallow for async copying into the kernel buffer cache,
etc, etc. There's a lot of different considerations; it isn't just two
opposing forces. And, I'd imagine that the relationship between the
number of buffers pinned and CPU cycles required for resowner/refcount
isn't perfectly linear.

> > I think it is weird to have combine_distance only be relevant when
> > readahead_distance is low. You said:
> >
> > > We could, but I don't think there would be a benefit in doing so. In my mind,
> > > what combine_distance is used for, is to control how far to look ahead to
> > > allow for IO combining when the readahead_distance is too low to allow for IO
> > > combining. But if pinned_buffers > 0, we already have another IO in flight,
> > > so the combine_distance mechanism doesn't have to kick in.
> >
> > But it seems like for a completely uncached workload bigger IOs is
> > still beneficial.
>
> Massively so - the storage layer getting too small IOs really hurts.
>
> But, as the code stands, we *do* end up with large IOs in that case, because
> we will not issue the IO until it is "complete". If we need to do actual IO,
> the readahead_distance will be larger, and allow multiple full sized IOs to be
> issued.
>
> /*
> * We don't start the pending read just because we've hit the distance limit,
> * preferring to give it another chance to grow to full io_combine_limit size
> * once more buffers have been consumed. But this is not desirable in all
> * situations - see below.
> */
> static inline bool
> read_stream_should_issue_now(ReadStream *stream)
> {
> ...
> /*
> * If we've already reached io_combine_limit, there's no chance of growing
> * the read further.
> */
> if (pending_read_nblocks >= stream->io_combine_limit)
> return true;
>
> /* same if capped not by io_combine_limit but combine_distance */
> if (stream->combine_distance > 0 &&
> pending_read_nblocks >= stream->combine_distance)
> return true;
>
> /*
> * If we currently have no reads in flight or prepared, issue the IO once
> * we are not looking ahead further. This ensures there's always at least
> * one IO prepared.
> */
> if (stream->pinned_buffers == 0 &&
> !read_stream_should_look_ahead(stream))
> return true;
>
> return false;
>
> So, unless we are not reading ahead, we are not issuing IOs until they are
> fully sized (or can't be combined, obviously).
>
> Am I misunderstanding what you're talking about?

I'm not saying that we don't do IO combining at high distances, I'm
more saying that it is confusing that combine_distance controls how
far we look ahead when readahead_distance is low but when
readahead_distance is high, it controls when we issue the IO and not
how far we look ahead. I don't think we should change course now, but
I wanted to call out that this felt a little uncomfortable to me.

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2026-04-03 19:09:27 Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
Previous Message Bharath Rupireddy 2026-04-03 19:04:48 Re: Introduce XID age based replication slot invalidation