Re: index prefetching

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-12-25 08:56:50
Message-ID: 383515d3-1122-47fd-b6ea-09152500496a@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 18/12/2025 4:45 PM, Andres Freund wrote:
> Hi,
>
> On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:
>> The technical reason is that batch_getnext() does this:
>>
>> /* Delay initializing stream until reading from scan's second batch */
>> if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
>> enable_indexscan_prefetch)
>> scan->xs_heapfetch->rs =
>> read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
>> ....);
>>
>> which means we only create the read_stream (which is what enables the
>> prefetching) only when creating the second batch. And with LIMIT 100 we
>> likely read just a single leaf page (=batch) most of the time, which
>> means no read_stream and thus no prefetching.
> Why is the logic tied to the number of batches, rather the number of items in
> batches? It's not hard to come up with scenarios where having to wait for ~100
> random pages will be the majority of the queries IO wait... It makes sense to
> not initialize readahead if we just fetch an entry or two, but after that?

I did more experiments trying to understand when we can take advantage
of prefetch:

So schema is the same:

create table t (pk integer, sk integer, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

And I do
pgbench -n -T 30 -M prepared -f select.sql postgres

limit\prefetch    on      off     always  incremental
1                 12074   12765    3146    3282
2                   5912     6198    2463    2438
4                   2919     3047    1334    1964
8                   1554     1496    1166    1409
16                   815       775      947      940
32                   424       403      687      695
64                   223       208      446      453
128                 115       106      258      270
256                  68          53      138      149
512                  43          27       72         78
1024                28          13       38         40

prefetch=always means commenting of `priorbatch` check and immediate
creation of read_stream:

        /* Delay initializing stream until reading from scan's second
batch */
-        if (priorbatch && !scan->xs_heapfetch->rs &&
!batchqueue->disabled &&+
+       if (/*priorbatch && */!scan->xs_heapfetch->rs &&
!batchqueue->disabled &&

prefetch=increment replaces doubling of prefetch distance with increment:

        /* Look-ahead distance ramps up rapidly after we do I/O. */
-        distance = stream->distance * 2;
+       distance = stream->distance ? stream->distance + 1 : 0;

So as you expected, immediate creation of read_stream cause quite
significant degrade of performance on indexscans inspecting small number
of TIDs.
Looks like the threshold where read stream provides advantages in
performance is about 10.
After it earlier initialization of read stream adds quite noticeable
performance improvement.

I tried to find out using profiler and debugger where most of the time
is spent in this case and answer was quite predictable -
in read_stream_reset->read_stream_next_buffer.

So we just consuming pefetched buffers which we do not need.

I thought that we can use some better policy for increasing prefetch
distance (right now it is just doubled).
This is why I have tried this "incremental" policy.
Unfortunately it  can not help to reduce prefetch overhead for "short"
indexscans.
But what surprised me is that for longer indexscans this approach seems
to be slightly more efficient than doubling.

So look like we really should use number of items criteria for read
stream initialization rather than number of batches.
And may be think about alternative policy for increasing prefetch distance.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2025-12-25 09:07:40 Re: cleanup: Split long Makefile lists across lines and sort them
Previous Message Alena Vinter 2025-12-25 08:47:05 Re: Startup PANIC on standby promotion due to zero-filled WAL segment