Re: AIO / read stream heuristics adjustments for index prefetching

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-02 14:31:50
Message-ID: CAAKRu_bfwBzg7=Zy88st6gBJf97Wkd3k=+m1ecApn=59SwmKSw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> 0005+0006: Only increase distance when waiting for IO
>
> Until now we have increased the read ahead distance whenever there we
> needed to do IO (doubling the distance every miss). But that will often be
> way too aggressive, with the IO subsystem being able to keep up with a
> much lower distance.
>
> The idea here is to use information about whether we needed to wait for IO
> before returning the buffer in read_stream_next_buffer() to control
> whether we should increase the readahead distance.
>
> This seems to work extremely well for worker.
>
> Unfortuntely with io_uring the situation is more complicated, because
> io_uring performs reads synchronously during submission if the data is the
> kernel page cache. This can reduce performance substantially compared to
> worker, because it prevents parallelizing the copy from the page cache.
> There is an existing heuristic for that in method_io_uring.c that adds a
> flag to the IO submissions forcing the IO to be processed asynchronously,
> allowing for parallelism. Unfortunately the heuristic is triggered by the
> number of IOs in flight - which will never become big enough to tgrigger
> after using "needed to wait" to control how far to read ahead.

On some level, relying on worker mode overhead feels fragile. If
worker overhead decreases—say, by moving to IO worker threads—we won't
be able to rely on this to keep the distance to an advantageous level.

If io_uring async copying is advantageous even when the consumer never
needs to wait, then it seems like parallelizing copying to/from the
kernel buffer cache will always be advantageous to do at some level.

The case where it is not (as you've stated before) is when the
consumer doesn't need the extra blocks, so it is just wasted time
spent acquiring them.

So, it feels odd to try and find a heuristic that allows the readahead
distance to increase even when the consumer is not having to wait. I'm
not saying we should do this for this release, but I'm just wondering
if in the medium term, we should try to find a better way to identify
the situation where async processing is not beneficial because the
blocks won't be needed.

> So 0005 expands the io_uring heuristic to also trigger based on the sizes
> of IOs - but that's decidedly not perfect, we e.g. have some experiments
> showing it regressing some parallel bitmap heap scan cases. It may be
> better to somehow tweak the logic to only trigger for worker.
>
> As is this has another issue, which is that it prevents IO combining in
> situations where it shouldn't, because right now using the distance to
> control both. See 0008 for an attempt at splitting those concerns.

Yea, I think running ahead far enough to get bigger IOs needs to
happen and can't be based on the consumer having to wait.

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-04-02 14:52:32 Re: Shared hash table allocations
Previous Message Andres Freund 2026-04-02 14:31:48 Re: LLVM 22