Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-08-19 17:23:24
Message-ID: 9af33041-1c16-4973-855a-718aa1048ee1@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/17/25 19:30, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> As far as I know, we only have the following unambiguous performance
>> regressions (that clearly need to be fixed):
>>
>> 1. This issue.
>>
>> 2. There's about a 3% loss of throughput on pgbench SELECT.
>
> Update: I managed to fix the performance regression with pgbench
> SELECT (regression 2). Since Andres' patch fixes the other regression
> (regression 1), we no longer have any known performance regression
> (though I don't doubt that they still exist somewhere). I've also
> added back the enable_indexscan_prefetch testing GUC (Andres asked me
> to do that a few weeks back). If you set
> enable_indexscan_prefetch=false, btgetbatch performance is virtually
> identical to master/btgettuple.
>
> A working copy of the patchset with these revisions is available from:
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.6
>
> The solution to the pgbench issue was surprisingly straightforward.
> Profiling showed that the regression was caused by the added overhead
> of using the read stream, for queries where prefetching cannot
> possibly help -- such small startup costs are relatively noticeable
> with pgbench's highly selective scans. It turns out that it's possible
> to initially avoid using a read stream, while still retaining the
> option of switching over to using a read stream later on. The trick to
> fixing the pgbench issue was delaying creating a read stream for long
> enough for the pgbench queries to never need to create one, without
> that impacting queries that at least have some chance of benefiting
> from prefetching.
>
> The actual heuristic I'm using to decide when to start the read stream
> is simple: only start a read stream right after the scan's second
> batch is returned by amgetbatch, but before we've fetched any heap
> blocks related to that second batch (start using a read stream when
> fetching new heap blocks from that second batch). It's possible that
> that heuristic isn't sophisticated enough for other types of queries.
> But either way the basic structure within indexam.c places no
> restrictions on when we start a read stream. It doesn't have to be
> aligned with amgetbatch-wise batch boundaries, for example (I just
> found that structure convenient).
>
> I haven't spent much time testing this change, but it appears to work
> perfectly (no pgbench regressions, but also no regressions in queries
> that were already seeing significant benefits from prefetching). I'd
> feel better about all this if we had better testing of the read stream
> invariants by (say) adding assertions to index_scan_stream_read_next,
> the read stream callback. And just having comments that explain those
> invariants.
>

Thanks for investigating this. I think it's the right direction - simple
OLTP queries should not be paying for building read_stream when there's
little chance of benefit.

Unfortunately, this seems to be causing regressions, both compared to
master (or disabled prefetching), and to the earlier prefetch patches.

I kept running the query generator [1] that builds data sets with
randomized parameters, and then runs index scan queries on that, looking
for differences between branches.

Consider this data set:

------------------------------------------------------------------------
create unlogged table t (a bigint, b text) with (fillfactor = 20);

insert into t select 1 * a, b from (select r, a, b,
generate_series(0,1-1) AS p from (select row_number() over () AS r, a, b
from (select i AS a, md5(i::text) AS b from generate_series(1, 10000000)
s(i) ORDER BY (i + 256 * (random() - 0.5))) foo) bar) baz ORDER BY ((r *
1 + p) + 128 * (random() - 0.5));

create index idx on t(a ASC);

vacuum freeze t;

analyze t;
------------------------------------------------------------------------

Let's run this query (all runs are with cold caches):

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 5085 AND 3053660 ORDER BY a ASC;

1) current patch
================

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx on t (actual time=0.517..6593.821 rows=3048576.00
loops=1)
Index Cond: ((a >= 5085) AND (a <= 3053660))
Index Searches: 1
Prefetch Distance: 2.066
Prefetch Count: 296179
Prefetch Stalls: 2553745
Prefetch Skips: 198613
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 74
Prefetch Histogram: [2,4) => 289560, [4,8) => 6604, [8,16) => 15
Buffers: shared hit=2704779 read=153516
Planning:
Buffers: shared hit=78 read=27
Planning Time: 5.525 ms
Execution Time: 6721.599 ms
(16 rows)

2) removed priorbatch (always uses read stream)
===============================================

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx on t (actual time=1.008..1932.379 rows=3048576.00
loops=1)
Index Cond: ((a >= 5085) AND (a <= 3053660))
Index Searches: 1
Prefetch Distance: 87.970
Prefetch Count: 2877141
Prefetch Stalls: 1
Prefetch Skips: 198617
Prefetch Resets: 0
Stream Ungets: 27182
Stream Forwarded: 7640
Prefetch Histogram: [2,4) => 2, [4,8) => 6, [8,16) => 7, [16,32) =>
10, [32,64) => 8183, [64,128) => 2868933
Buffers: shared hit=2704571 read=153516
Planning:
Buffers: shared hit=78 read=27
Planning Time: 14.302 ms
Execution Time: 2036.654 ms
(16 rows)

3) no prefetch (same as master)
===============================

set enable_indexscan_prefetch = off;

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx on t (actual time=0.850..1336.723 rows=3048576.00
loops=1)
Index Cond: ((a >= 5085) AND (a <= 3053660))
Index Searches: 1
Buffers: shared hit=2704779 read=153516
Planning:
Buffers: shared hit=82 read=22
Planning Time: 10.696 ms
Execution Time: 1433.530 ms
(8 rows)

The main difference in the explains is this:

Prefetch Distance: 2.066 (new patch)

Prefetch Distance: 87.970 (old patch, without priorbatch)

The histogram just confirms this, with most prefetches either in [2,4)
or [64,128) bins. The new patch has much lower prefetch distance.

I believe this is the same issue with "collapsed" distance after
resetting the read_stream. In that case the trouble was the reset also
set distance to 1, and there were so many "hits" due to buffers read
earlier it never ramped up again (we doubled it every now and then, but
the decay was faster).

The same thing happens here, I think. We process the first batch without
using a read stream. Then after reading the second batch we create the
read_stream, but it starts with distance=1 - it's just like after reset.
And it never ramps up the distance, because of the hits from reading the
preceding batch.

For the resets, the solution (at least for now) was to remember the
distance and restore it after reset. But here we don't have any distance
to restore - there's no prefetch or read stream.

Maybe it'd be possible to track some stats, during the initial phase,
and then use that to initialize the distance for the first batch
processed by read stream? Seems rather inconvenient, though.

What exactly is the overhead of creating the read_stream? Is that about
allocating memory, or something else? Would it be possible to reduce the
overhead enough to not matter even for OLTP queries? Maybe it would be
possible to initialize the read_stream only "partially", enough to do do
sync I/O and track the distance, and delay only the expensive stuff?

I'm also not sure it's optimal to only initialize read_stream after
reading the next batch. For some indexes a batch can have hundreds of
items, and that certainly could benefit from prefetching. I suppose it
should be possible to initialize the read_stream half-way though a
batch, right? Or is there a reason why that can't work?

regards

[1]
https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-08-19 17:29:14 Re: Improve LWLock tranche name visibility across backends
Previous Message Kirill Reshke 2025-08-19 17:20:57 Re: VM corruption on standby