Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-07-23 16:55:02
Message-ID: cffa6e3e-f9c4-468f-b218-a3a5e172c226@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/23/25 17:09, Andres Freund wrote:
> Hi,
>
> On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:
>> On 7/23/25 02:59, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>>>> But I don't see why would this have any effect on the prefetch distance,
>>>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>>>> that. I'd have expected exactly the opposite behavior.
>>>>
>>>> Could be bug, of course. But it'd be helpful to see the dataset/query.
>>>
>>> Pgbench scale 500, with the simpler query from my message.
>>>
>>
>> I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
>> you monitor the queue depth (presumably iostat?)
>
> Yes, iostat, since I was looking at what the "actually required" lookahead
> distance is.
>
> Do you actually get the query to be entirely CPU bound? What amount of IO
> waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
> report?
>

No, it definitely needs to wait for I/O (FWIW it's on the xeon, with a
single NVMe SSD).

> Ah - I was using a very high effective_io_concurrency. With a high
> effective_io_concurrency value I see a lot of stalls, even at
> INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
> somewhat odd.
>

I think that's a bug in the explain patch. The counters were updated at
the beginning of _next_buffer(), but that's wrong - a single call to
_next_buffer() can prefetch multiple blocks. This skewed the stats, as
the prefetches are not counted with "distance=0". With higher eic this
happens sooner, so the average distance seemed to decrease.

The attached patch does the updates in _get_block(), which I think is
better. And "stall" now means (distance == 1), which I think detects
requests without prefetching.

I also added a separate "Count" for the actual number of prefetched
blocks, and "Skipped" for duplicate blocks skipped (which the read
stream never even sees, because it's skipped in the callback).

>
> FWIW, in my tests I was just evicting lineitem from shared buffers, since I
> wanted to test the heap prefetching, without stalls induced by blocking on
> index reads. But what I described happens with either.
>
> ;SET effective_io_concurrency = 256;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
> ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN │
> ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (actual rows=1.00 loops=1) │
> │ Buffers: shared hit=27369 read=164191 │
> │ I/O Timings: shared read=358.795 │
> │ -> Limit (actual rows=10000000.00 loops=1) │
> │ Buffers: shared hit=27369 read=164191 │
> │ I/O Timings: shared read=358.795 │
> │ -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
> │ Index Searches: 1 │
> │ Prefetch Distance: 256.989 │
> │ Prefetch Stalls: 3 │
> │ Prefetch Resets: 3 │
> │ Buffers: shared hit=27369 read=164191 │
> │ I/O Timings: shared read=358.795 │
> │ Planning Time: 0.086 ms │
> │ Execution Time: 4194.845 ms │
> └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>
> ;SET effective_io_concurrency = 512;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
> ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN │
> ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (actual rows=1.00 loops=1) │
> │ Buffers: shared hit=27368 read=164190 │
> │ I/O Timings: shared read=832.515 │
> │ -> Limit (actual rows=10000000.00 loops=1) │
> │ Buffers: shared hit=27368 read=164190 │
> │ I/O Timings: shared read=832.515 │
> │ -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
> │ Index Searches: 1 │
> │ Prefetch Distance: 56.778 │
> │ Prefetch Stalls: 160569 │
> │ Prefetch Resets: 423 │
> │ Buffers: shared hit=27368 read=164190 │
> │ I/O Timings: shared read=832.515 │
> │ Planning Time: 0.084 ms │
> │ Execution Time: 4413.058 ms │
> └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>
> Greetings,
>

The attached v2 explain patch should fix that. I'm also attaching logs
from my explain, for 64 and 16 batches. I think the output makes much
more sense now.

cheers

--
Tomas Vondra

Attachment Content-Type Size
batches-64.log text/x-log 14.9 KB
batches-16.log text/x-log 17.7 KB
prefetch-distance-explain-v2.patch text/x-patch 8.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-07-23 17:07:00 Re: Verify predefined LWLocks tranches have entries in wait_event_names.txt
Previous Message Nathan Bossart 2025-07-23 16:54:52 Re: trivial grammar refactor