Re: AIO / read stream heuristics adjustments for index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: 2026-04-02 15:47:39
Message-ID: pj4kgtdrevvkfbmlri6p27belctxru7ytyprcb6v74c7zbh3l6@m4dcu2rljedv
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-04-02 10:31:50 -0400, Melanie Plageman wrote:
> On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > 0005+0006: Only increase distance when waiting for IO
> >
> > Until now we have increased the read ahead distance whenever there we
> > needed to do IO (doubling the distance every miss). But that will often be
> > way too aggressive, with the IO subsystem being able to keep up with a
> > much lower distance.
> >
> > The idea here is to use information about whether we needed to wait for IO
> > before returning the buffer in read_stream_next_buffer() to control
> > whether we should increase the readahead distance.
> >
> > This seems to work extremely well for worker.
> >
> > Unfortuntely with io_uring the situation is more complicated, because
> > io_uring performs reads synchronously during submission if the data is the
> > kernel page cache. This can reduce performance substantially compared to
> > worker, because it prevents parallelizing the copy from the page cache.
> > There is an existing heuristic for that in method_io_uring.c that adds a
> > flag to the IO submissions forcing the IO to be processed asynchronously,
> > allowing for parallelism. Unfortunately the heuristic is triggered by the
> > number of IOs in flight - which will never become big enough to tgrigger
> > after using "needed to wait" to control how far to read ahead.
>
> On some level, relying on worker mode overhead feels fragile. If
> worker overhead decreases—say, by moving to IO worker threads—we won't
> be able to rely on this to keep the distance to an advantageous level.

I don't see why lower overhead would prevent this from working?

> If io_uring async copying is advantageous even when the consumer never
> needs to wait, then it seems like parallelizing copying to/from the
> kernel buffer cache will always be advantageous to do at some level.

It's not universally advantageous, unfortunately - there's a nontrivial
increase in latency (and also some CPU) due to it. Which matters mostly when
having a shallow look-ahead depth (like at the start of a stream), because
then the latency impact will directly influence query performance.

Setup:

CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, 500)) AS pattern;

workload:

SET io_combine_limit = 1;

SET effective_io_concurrency=1;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;

(and then repeated for eic 2,4,8,16)

eic time plain ms time w/ forced async
1 2.331 5.366
2 2.164 3.210
4 2.151 2.677
8 2.155 2.749
16 2.151 2.742
32 2.141 2.732
64 2.161 2.739
128 2.153 2.652

Note that forced async never quite catches up.

If I instead make the pattern 50k blocks long:

eic time plain ms time w/ forced async
1 210.678 454.132
2 209.210 281.452
4 208.775 198.496
8 208.755 198.131
16 209.477 195.799
32 203.497 183.297
64 203.002 173.799
128 202.885 166.548

> The case where it is not (as you've stated before) is when the
> consumer doesn't need the extra blocks, so it is just wasted time
> spent acquiring them.

That's one reason, but as shown above, it's also that the increase in latency
can hurt, particularly in the first few blocks (where we are ramping up the
distance) and when effective_io_concurrency is too low to allow for a deep
enough read-ahead to allow to hide the latency increase.

> So, it feels odd to try and find a heuristic that allows the readahead
> distance to increase even when the consumer is not having to wait.

Do you still feel like that with the added context from the above?

> I'm not saying we should do this for this release, but I'm just wondering if
> in the medium term, we should try to find a better way to identify the
> situation where async processing is not beneficial because the blocks won't
> be needed.

I think we certainly can do better than today with some help, e.g. from the
planner, to identify cases where we should be more careful about reading ahead
too far, e.g. due to being on the inner side of an nestloop antijoin.

> > So 0005 expands the io_uring heuristic to also trigger based on the sizes
> > of IOs - but that's decidedly not perfect, we e.g. have some experiments
> > showing it regressing some parallel bitmap heap scan cases. It may be
> > better to somehow tweak the logic to only trigger for worker.
> >
> > As is this has another issue, which is that it prevents IO combining in
> > situations where it shouldn't, because right now using the distance to
> > control both. See 0008 for an attempt at splitting those concerns.
>
> Yea, I think running ahead far enough to get bigger IOs needs to
> happen and can't be based on the consumer having to wait.

What do you think about the updated patch to achieve that that I posted?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-04-02 15:53:24 Re: vectorized CRC on ARM64
Previous Message Daniil Davydov 2026-04-02 15:10:46 Re: POC: Parallel processing of indexes in autovacuum