Re: index prefetching

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-12-18 16:35:55
Message-ID: 03cbf145-ea85-49d3-b5fc-71ed144c28e2@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 18/12/2025 4:40 PM, Tomas Vondra wrote:
>
> On 12/18/25 14:57, Konstantin Knizhnik wrote:
>> On 17/12/2025 9:54 PM, Tomas Vondra wrote:
>>> On 12/17/25 20:30, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:
>>>>> On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik
>>>>> <knizhnik(at)garret(dot)ru> wrote:
>>>>>> Moreover with `enable_indexscan_prefetch=off` results are the same.
>>>>> It's quite unlikely that the current heuristics that trigger
>>>>> prefetching would have ever allowed any prefetching, for queries such
>>>>> as these.
>>>>>
>>>>> The exact rule right now is that we don't even begin prefetching until
>>>>> we've already read at least one index leaf page, and have to read
>>>>> another one. So it's impossible to use prefetching with a LIMIT of 1,
>>>>> with queries such as these. It's highly unlikely that you'd see any
>>>>> benefits from prefetching even with LIMIT 100 (usually we wouldn't
>>>>> even begin prefetching).
>>>> Note that due to the tuple size and fillfactor in Konstantin's
>>>> workload, there
>>>> will be one tuple per page... That should allow for some prefetching.
>>>>
>>> Yes, but that's in the heap. The mechanism Peter described is about leaf
>>> pages in the index, and the index has the usual fillfactor. So there'll
>>> be many index entries per leaf.
>>>
>> I slightly change my benchmark setup:
>>
>> create table t (pk integer, sk integer, payload text default repeat('x',
>> 1000)) with (fillfactor=10);
>> insert into t values (generate_series(1,10000000),random()*10000000);
>> create index on t(sk);
>>
>> select.sql:
>>
>> \set sk random(1, 10000000)
>> select * from t where sk >= :sk order by sk limit N;
>>
>> You are right. There is almost no effect of prefetch for limit=100, but
>> ~2x times improvement for limit=1000:
>>
>> eio\limit       1      100   1000
>>  10          11102    142    28
>>   0           11419    137    14
>>
>> master:
>> limit              1     100   1000
>>                11480   130      13
>>
>> One of the motivation of my experiments was to check that there is no
>> degrade of performance because of batching.
>> And it is nice that there is no performance penalty here.
>> Still it is not quite clear to me why there is no any positive effect
>> for LIMIT 100.
> The technical reason is that batch_getnext() does this:
>
> /* Delay initializing stream until reading from scan's second batch */
> if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
> enable_indexscan_prefetch)
> scan->xs_heapfetch->rs =
> read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
> ....);
>
> which means we only create the read_stream (which is what enables the
> prefetching) only when creating the second batch. And with LIMIT 100 we
> likely read just a single leaf page (=batch) most of the time, which
> means no read_stream and thus no prefetching.
>
> You can try disabling this "priorbatch" condition, so that the
> read_stream gets created right away.

It makes the expected effect - performance of LIMIT 100 is increased
from 142TPS to 315TPS (so also 2x times). At the same time performance
of LIMIT 1 is reduced from 11419 to 3499 - ~4x times For LIMIT 10 result
are 1388 with disabled prefetch and 1116 with enabled prefetch. So looks
like threshold for enabling prefetch should be based not on number of
batches, but on expected heap reads and it is more closer to 100 or even
10 than to 1000. And for slower disks (or remote storage), effect of
prefetch should be much bigger.

> True, but only if the data is not already in memory / shared buffers.
> IIRC this "priorbatch" logic mitigates regressions for cached workloads,
> because the read_stream initialization is expensive enough to hurt small
> queries when no I/O is needed.

I see.
But may be we should compare table size with shared buffers or
effective_cache_size?

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-12-18 16:57:37 Re: Fixing the btree_gist inet mess
Previous Message Anthonin Bonnefoy 2025-12-18 16:12:24 Re: Fix possible 'unexpected data beyond EOF' on replica restart