Re: index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-02-15 22:39:21
Message-ID: 7herwtpae3ptqdng3s7tcft4ljkc23fyocp3mbrvc7xyk7s2lk@uq3qbm4blizo
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-02-15 22:17:05 +0100, Tomas Vondra wrote:
> I don't have access to a M1 machine (and it also does not say what type
> of storage is it using, which seems pretty important for a patch aiming
> to improve I/O behavior). But I tried running this on my ryzen machine
> with local SSDs (in RAID0), and with the 100k rows (and fixed handling
> of page cache) I get this:
>
> column_name io_method evict n master_ms off_ms on_ms effect_pct
> periodic worker off 10 35.8 35.1 36.5 2.0
> periodic worker os 10 49.4 49.9 58.8 8.1
> periodic worker pg 10 39.5 39.9 47.1 8.3
> random worker off 10 35.9 35.6 35.7 0.2
> random worker os 10 49.0 49.0 42.6 -7.0
> random worker pg 10 39.6 39.9 40.9 1.2
> sequential worker off 10 28.2 27.9 27.7 -0.4
> sequential worker os 10 39.3 39.2 34.8 -6.0
> sequential worker pg 10 30.1 30.1 29.4 -1.3
>
> column_name io_method evict n master_ms off_ms on_ms effect_pct
> periodic io_uring off 10 35.9 35.8 35.8 -0.1
> periodic io_uring os 10 49.3 49.9 50.0 0.1
> periodic io_uring pg 10 40.1 39.8 41.7 2.4
> random io_uring off 10 35.6 35.2 35.7 0.8
> random io_uring os 10 49.1 48.9 46.1 -3.0
> random io_uring pg 10 39.8 40.1 42.6 3.1
> sequential io_uring off 10 28.0 27.8 28.0 0.4
> sequential io_uring os 10 39.8 39.1 40.7 1.9
> sequential io_uring pg 10 30.2 30.0 29.6 -0.8
>
> This is on default config with io_workers=12 and data_checksums=off. I'm
> not showing results for parallel query, because it's irrelevant.
>
> This also has timings for master, for worker and io_uring (which you
> could not get on M1, at least no in MacOS). For "worker" the differences
> are much smaller (within 10% in the worst case), and almost non-existent
> for io_uring. Which suggests this is likely due to the "signal" overhead
> associated with worker, which can be annoying for certain data patterns
> (where we end up issuing an I/O for individual blocks at distance 1).

I don't think this is just the signalling issue. For "periodic" I think it's
the signalling issue triggered by the read stream distance being kept too
low. Due to the small distance, the latency affects us much more.

Any my system, with turbo boost etc disabled.

worker w/ enable_indexscan_prefetch=0:

Index Scan using idx_periodic_100000 on prefetch_test_data_100000 (cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..84.129 rows=100000.00 loops=1)
Index Searches: 1
Buffers: shared hit=97150 read=3125
I/O Timings: shared read=31.274
Planning:
Buffers: shared hit=97 read=7
I/O Timings: shared read=0.595
Planning Time: 0.944 ms
Execution Time: 89.319 ms

worker w/ enable_indexscan_prefetch=1:

Index Scan using idx_periodic_100000 on prefetch_test_data_100000 (cost=0.29..15101.09 rows=100000 width=208) (actual time=0.158..115.279 rows=100000.00 loops=1)
Index Searches: 1
Prefetch: distance=1.060 count=99635 stalls=3004 skipped=0 resets=0 pauses=0 ungets=0 forwarded=0
histogram [1,2) => 93627, [2,4) => 6008
Buffers: shared hit=97150 read=3125
I/O Timings: shared read=56.077
Planning:
Buffers: shared hit=97 read=7
I/O Timings: shared read=0.612
Planning Time: 0.994 ms
Execution Time: 120.575 ms

Right, a regression. But note how low the distance is - no wonder the worker
latency has a bad effect - we only have the downside, never the upside, as
there's pretty much no IO concurrency.

After applying this diff:

@@ -1006,7 +1038,9 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
stream->oldest_io_index = 0;

/* Look-ahead distance ramps up rapidly after we do I/O. */
- distance = stream->distance * 2;
+ distance = stream->distance * 2
+ + 1
+ ;
distance = Min(distance, stream->max_pinned_buffers);
stream->distance = distance;

worker w/ enable_indexscan_prefetch=1 + patch:

Index Scan using idx_periodic_100000 on prefetch_test_data_100000 (cost=0.29..15101.09 rows=100000 width=208) (actual time=0.157..82.673 rows=100000.00 loops=1)
Index Searches: 1
Prefetch: distance=70.892 count=103109 stalls=5 skipped=0 resets=0 pauses=0 ungets=3474 forwarded=0
histogram [1,2) => 88975, [2,4) => 5, [4,8) => 11, [8,16) => 26, [16,32) => 28, [32,64) => 64, [64,128) => 104, [128,256) => 136, [256,512) => 602, [512,1024) => 13158
Buffers: shared hit=97150 read=3125
I/O Timings: shared read=19.711
Planning:
Buffers: shared hit=97 read=7
I/O Timings: shared read=0.596
Planning Time: 0.951 ms
Execution Time: 87.887 ms

By no means a huge win compared to prefetching being disabled, but the
regression does vanish.

The problem that this fixes is that the periodic workload has cache hits
frequently, which reduce the stream->distance by 1. Then, on a miss, we double
the distance. But that means that if you have the trivial pattern of one hit
and one miss, which this workload very often has, you *never* get above
1. I.e. we increase the distance as quickly as we decrease it.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-02-15 22:50:30 Re: Small improvements to substring()
Previous Message Peter Smith 2026-02-15 22:21:55 Re: use the malloc macros in pg_dump.c