Re: index prefetching

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-12-29 16:34:47
Message-ID: f0631dfa-0158-472b-8d6f-95728a5865a9@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 29/12/2025 1:53 AM, Tomas Vondra wrote:
>
> On 12/28/25 21:30, Konstantin Knizhnik wrote:
>> On 28/12/2025 8:08 PM, Tomas Vondra wrote:
>>> On 12/25/25 16:39, Konstantin Knizhnik wrote:
>>>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
>>>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>>>>>> Attached is v4.
>>>>> Attached is v5. Changes from v4:
>>>>>
>>>>> * Simplified and optimized index-only scans, with a particular
>>>>> emphasis on avoiding regressions with nested loop joins with an inner
>>>>> index-only scan.
>>>>>
>>>>> There were quite a number of small problems/dead code related to
>>>>> index-only scans fixed by this new v5. Overall, I'm quite a bit
>>>>> happier with the state of index-only scans, which I'd not paid too
>>>>> much attention to before now.
>>>>>
>>>>> * Added Valgrind instrumentation to the hash index patch, which was
>>>>> required to fix some false positives.
>>>>>
>>>>> The generic indexam_util_batch_unlock routine had Valgrind
>>>>> instrumentation in earlier versions, just to keep nbtree's buffer
>>>>> locking checks from generating similar false positives. Some time
>>>>> later, when I added the hashgetbatch patch, there were new Valgrind
>>>>> false positives during hash index scans -- which I missed at first.
>>>>> This new v5 revisions adds similar Valgrind checks to hash itself
>>>>> (changes that add code that is more or less a direct port of the stuff
>>>>> added to nbtree by commit 4a70f829), which fixes the false positives,
>>>>> and is independently useful.
>>>>>
>>>>> The rule for amgetbatch-based index AMs is that they must have similar
>>>>> buffer locking instrumentation. That seems like a good thing.
>>>>>
>>>>> --
>>>>> Peter Geoghegan
>>>> I the previous mail I shared results of my experiments with different
>>>> prefetch distance.
>>>> I think that we should start prefetching of heap tuples not from the
>>>> second batch, but after some number of proceeded tids.
>>>>
>>>> Attached please find a patch which implements this approach.
>>>> And below are updated results:
>>>>
>>>> limit\prefetch    on      off   always  inc    threshold
>>>> 1                 12074   12765  3146    3282     12394
>>>> 2                 5912    6198   2463    2438      6124
>>>> 4                 2919    3047   1334    1964      2910
>>>> 8                 1554    1496   1166    1409      1588
>>>> 16                815     775    947     940        600
>>>> 32                424     403    687     695        478
>>>> 64                223     208    446     453        358
>>>> 128               115     106    258     270        232
>>>> 256               68      53     138     149        131
>>>> 512               43      27     72      78          71
>>>> 1024              28      13     38      40          38
>>>>
>>>> Last column is result of prefetch with read_stream_threshold=10.
>>>>
>>> That's great, but it only works for cases that can (and do) benefit from
>>> the prefetching. Try running the benchmark with a data set that fits
>>> into shared buffers (or RAM), which makes prefetching useless.
>>>
>>> I tried that with your test, comparing master, v5 and v5 + your
>>> read_stream_threshold patch. See the attached run.sh script, and the PDF
>>> summarizing the results. The last two column groups are comparisons to
>>> master, with green=improvement, red=regression. There are no actual
>>> improvements (1% delta is just noise). But the read_stream_threshold
>>> results have a clear pattern of pretty massive (20-30%) regressions.
>>>
>>> The difference between v5 and v5-threshold is pretty clear.
>>>
>>> IIRC cases like this are *exactly* why we ended up with the current
>>> heuristics, enabling prefetching only from the second batch. This
>>> removes the risk of expensive read_stream init for very fast queries
>>> that don't benefit anything. Of course, prefetching may be useless for
>>> later batches too (e.g. if all the data is cached), but the query will
>>> be expensive enough for the read_stream init cost to be negligible.
>>>
>>> To put this differently, the more aggressive the heuristics is (enabling
>>> prefetching in more case), the more likely it's to cause regressions.
>>> We've chosen to be more defensive, i.e. to sacrifice some possible gains
>>> in order to not regress plausible workloads. I hope we agree queries on
>>> fully cached "hot" data are pretty common / important.
>>>
>>> We can probably do better in the future. But we'll never know for sure
>>> if a given scan benefits from prefetching. It's not just about the
>>> number of items in the batch, but also about how many heap pages that
>>> translates to, what I/O pattern (random vs. sequential?), how many are
>>> already cached. For some queries we don't even know how many items we'll
>>> actually need. We can't check all that at the very beginning, because
>>> it's simply prohibitively expensive.
>>
>> I tried to reproduce your results, but at Mac I do not see some
>> noticeable difference  for 250k records, fillfactor=10 and 4GB shared
>> buffers
>> between `enable_indexscan_prefetch=false` and
>> `enable_indexscan_prefetch=true`.
>> I can't believe that just adding this checks in `heap_batch_advance_pos`
>> can cause 75% degrade of performance (because for limit < 10, no read
>> stream is initialized, but still we somewhere loose 25%).
>>
>> I just commented this fragment of code in heapam_handler.c:
>>
>>
>> #if 0
>>     proceed_items = ScanDirectionIsForward(direction)
>>         ? pos->item - batch->firstItem
>>         : batch->lastItem - pos->item;
>>     /* Delay initializing stream until proceeding */
>>     if (proceed_items >= read_stream_threshold
>>         && !scan->xs_heapfetch->rs
>>         && !scan->batchqueue->disabled
>>         && !scan->xs_want_itup    /* XXX prefetching disabled for IoS,
>> for now */
>>         && enable_indexscan_prefetch)
>>     {
>>         scan->xs_heapfetch->rs =
>>             read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
>>                                        scan->heapRelation, MAIN_FORKNUM,
>>  scan->heapRelation->rd_tableam->index_getnext_stream,
>>                                        scan, 0);
>>     }
>> #endif
>>
>> and ... see no difference.
>>
>> I can understand why initializing read stream earlier (not at the second
>> batch, but after 10 proceeded items) may have negative impact on
>> performance when all data is present i shared buffers for LIMIT>=10.
>> But how it can happen with LIMIT 1 and commented fragment above. There
>> is nothing else in my patch except adding GUC.
>> So I think that it is some "external" factor and wonder if you can
>> reproduce this results (just first line).
>>
> It seems this is due to sending an extra SET (for the new GUC) in the
> pgbench script, which is recognized only on the v5+threshold build.
>
> That's a thinko on my side, I should have realized the extra command
> might affect this. It doesn't really affect the behavior, because 10 is
> the default value for read_stream_threshold. I've fixed the script, will
> check fresh results tomorrow.
>
> Still, I think most of what I said about heuristics when to initialize
> the read stream, and the risk/benefit tradeoff, still applies.
>
>
> regards

Attached please find alternative version of the proposed patch which use
number of disk reads as criteria for using read stream.

Attachment Content-Type Size
read_stream_threshold_v2.patch text/plain 5.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Manni Wood 2025-12-29 17:03:17 Re: Speed up COPY FROM text/CSV parsing using SIMD
Previous Message Tom Lane 2025-12-29 16:05:02 Re: lsyscache: free IndexAmRoutine objects returned by GetIndexAmRoutineByAmId()