Quick Links

Re: index prefetching

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-08-14 00:59:10
Message-ID:	7c308720-5cd4-4508-8cc0-6e55a6352731@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 8/14/25 01:50, Peter Geoghegan wrote:
> On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote:
>> I did investigate this, and I don't think there's anything broken in
>> read_stream. It happens because ReadStream has a concept of "ungetting"
>> a block, which can happen after hitting some I/O limits.
>>
>> In that case we "remember" the last block (in read_stream_look_ahead
>> calls read_stream_unget_block), and we return it again. It may seem as
>> if read_stream_get_block() produced the same block twice, but it's
>> really just the block from the last round.
>
> I instrumented this for myself, and I agree: backwards and forwards scan cases
> are being fed the same block numbers, as expected (it's just that the order is
> precisely backwards, as expected). The only real difference is that the forwards
> scan case seems to be passed InvalidBlockNumber quite a bit more often. You were
> right: I was confused about the read_stream_unget_block thing.
>
> However, the magnitude of the difference that I see between the forwards and
> backwards scan cases just doesn't pass the smell test -- I stand by that part.
> I was able to confirm this intuition by performing a simple experiment.
>
> I asked myself a fairly obvious question: if the backwards scan in question
> takes about 2.5x as long, just because each group of TIDs for each index value
> appears in descending order, then what happens if the order is made random?
> Where does that leave the forwards scan case, and where does it leave the
> backwards scan case?
>
> I first made the order of the table random, except among groups of index tuples
> that have exactly the same value. Those will still point to the same 1 or 2 heap
> blocks in virtually all cases, so we have "heap clustering without any heap
> correlation" in the newly rewritten table. To set things up this way, I first
> made another index, and then clustered the table using that new index:
>
> pg(at)regression:5432 [2476413]=# create index on t (hashint8(a));
> CREATE INDEX
> pg(at)regression:5432 [2476413]=# cluster t using t_hashint8_idx ;
> CLUSTER
>
> Next, I reran the queries in the obvious way (same procedure as yesterday,
> though with a very different result):
>
> pg(at)regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
> ***SNIP***
> pg(at)regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
> ┌────────────────────────────────────────────────────────────┐
> │ QUERY PLAN │
> ├────────────────────────────────────────────────────────────┤
> │ Index Scan using idx on t (actual rows=1048576.00 loops=1) │
> │ Index Cond: ((a >= 16336) AND (a <= 49103)) │
> │ Index Searches: 1 │
> │ Buffers: shared hit=6082 read=77813 │
> │ I/O Timings: shared read=153.672 │
> │ Planning Time: 0.057 ms │
> │ Execution Time: 402.735 ms │
> └────────────────────────────────────────────────────────────┘
> (7 rows)
>
> pg(at)regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
> ***SNIP***
> pg(at)regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
> ┌─────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN │
> ├─────────────────────────────────────────────────────────────────────┤
> │ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │
> │ Index Cond: ((a >= 16336) AND (a <= 49103)) │
> │ Index Searches: 1 │
> │ Buffers: shared hit=6082 read=77813 │
> │ I/O Timings: shared read=324.305 │
> │ Planning Time: 0.071 ms │
> │ Execution Time: 616.268 ms │
> └─────────────────────────────────────────────────────────────────────┘
> (7 rows)
>
> Apparently random I/O is twice as fast as sequential I/O in descending order! In
> fact, this test case creates the appearance of random I/O being at least
> slightly faster than sequential I/O for pages read in _ascending_ order!
>
> Obviously something doesn't add up here. I'm no closer to explaining what the
> underlying problem is than I was yesterday, but I find it _very_ hard to believe
> that the inconsistency in performance has anything to do with SSD firmware/OS
> implementation details. It just looks wonky to me.
>
> Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong.
> Though now it's wrong in the same way with both variants.
>
> Just for context, I'll show what the TIDs for 3 randomly chosen
> adjacent-in-index values look like after CLUSTER runs (in case it was unclear
> what I meant about "heap clustering without any heap correlation" earlier):
>

Interesting. It's really surprising random I/O beats the sequential.

I investigated this from a different angle, by tracing the I/O request
generated. using perf-trace. And the patterns are massively different.

What I did is roughly this:

1) restart the instance (with direct I/O)

2) perf trace record -m 128M -a -o $(date +%s).trace

3) run the query, pgrep 'io worker'

4) stop the trace

5) extract pread64 events for the I/O workers from the trace

I get these event counts:

Q1 ASC - 5395
Q1 DESC - 49969
Q2 ASC - 32804
Q2 DESC - 49958

It's interesting the DESC queries get to do almost exactly the same
number of pread calls.

Anyway, small samples of the trace look like this:

Q1 ASC

pread64(fd: 7, buf: 0x7f6011b7f000, count: 81920, pos: 475193344)
pread64(fd: 24, buf: 0x7f6011b95000, count: 131072, pos: 475275264)
pread64(fd: 7, buf: 0x7f6011bb7000, count: 131072, pos: 475406336)
pread64(fd: 24, buf: 0x7f6011bd9000, count: 131072, pos: 475537408)
pread64(fd: 7, buf: 0x7f6011bfb000, count: 81920, pos: 475668480)
pread64(fd: 24, buf: 0x7f6011c0f000, count: 24576, pos: 475750400)
pread64(fd: 24, buf: 0x7f6011c15000, count: 24576, pos: 475774976)
pread64(fd: 24, buf: 0x7f6011c1d000, count: 131072, pos: 475799552)
pread64(fd: 7, buf: 0x7f6011c3f000, count: 106496, pos: 475930624)
pread64(fd: 24, buf: 0x7f6011c59000, count: 24576, pos: 476037120)
pread64(fd: 24, buf: 0x7f6011c61000, count: 131072, pos: 476061696)
pread64(fd: 7, buf: 0x7f6011c83000, count: 131072, pos: 476192768)
pread64(fd: 24, buf: 0x7f6011ca3000, count: 24576, pos: 476323840)
pread64(fd: 24, buf: 0x7f6011ca9000, count: 24576, pos: 476348416)
pread64(fd: 24, buf: 0x7f6011cb1000, count: 131072, pos: 476372992)
pread64(fd: 7, buf: 0x7f6011cd1000, count: 57344, pos: 476504064)

Q1 DESC

pread64(fd: 24, buf: 0x7fa8c1735000, count: 8192, pos: 230883328)
pread64(fd: 7, buf: 0x7fa8c1737000, count: 8192, pos: 230875136)
pread64(fd: 6, buf: 0x7fa8c173b000, count: 8192, pos: 230866944)
pread64(fd: 24, buf: 0x7fa8c173d000, count: 8192, pos: 230858752)
pread64(fd: 7, buf: 0x7fa8c173f000, count: 8192, pos: 230850560)
pread64(fd: 6, buf: 0x7fa8c1741000, count: 8192, pos: 230842368)
pread64(fd: 24, buf: 0x7fa8c1743000, count: 8192, pos: 230834176)
pread64(fd: 7, buf: 0x7fa8c1745000, count: 8192, pos: 230825984)
pread64(fd: 24, buf: 0x7fa8c1747000, count: 8192, pos: 230817792)
pread64(fd: 6, buf: 0x7fa8c1749000, count: 8192, pos: 230809600)
pread64(fd: 7, buf: 0x7fa8c174b000, count: 8192, pos: 230801408)
pread64(fd: 24, buf: 0x7fa8c174d000, count: 8192, pos: 230793216)
pread64(fd: 6, buf: 0x7fa8c174f000, count: 8192, pos: 230785024)
pread64(fd: 7, buf: 0x7fa8c1751000, count: 8192, pos: 230776832)
pread64(fd: 24, buf: 0x7fa8c1753000, count: 8192, pos: 230768640)
pread64(fd: 7, buf: 0x7fa8c1755000, count: 8192, pos: 230760448)
pread64(fd: 6, buf: 0x7fa8c1757000, count: 8192, pos: 230752256)

Q2 ASC

pread64(fd: 7, buf: 0x7fb8bbf27000, count: 8192, pos: 258695168)
pread64(fd: 24, buf: 0x7fb8bbf29000, count: 16384, pos: 258678784)
pread64(fd: 7, buf: 0x7fb8bbf2d000, count: 8192, pos: 258670592)
pread64(fd: 24, buf: 0x7fb8bbf2f000, count: 16384, pos: 258654208)
pread64(fd: 7, buf: 0x7fb8bbf33000, count: 8192, pos: 258646016)
pread64(fd: 24, buf: 0x7fb8bbf35000, count: 16384, pos: 258629632)
pread64(fd: 7, buf: 0x7fb8bbf39000, count: 8192, pos: 258621440)
pread64(fd: 24, buf: 0x7fb8bbf3d000, count: 16384, pos: 258605056)
pread64(fd: 7, buf: 0x7fb8bbf41000, count: 8192, pos: 258596864)
pread64(fd: 24, buf: 0x7fb8bbf43000, count: 16384, pos: 258580480)
pread64(fd: 7, buf: 0x7fb8bbf47000, count: 8192, pos: 258572288)
pread64(fd: 24, buf: 0x7fb8bbf49000, count: 16384, pos: 258555904)
pread64(fd: 7, buf: 0x7fb8bbf4d000, count: 8192, pos: 258547712)
pread64(fd: 24, buf: 0x7fb8bbf4f000, count: 16384, pos: 258531328)
pread64(fd: 7, buf: 0x7fb8bbf53000, count: 16384, pos: 258514944)
pread64(fd: 24, buf: 0x7fb8bbf57000, count: 8192, pos: 258506752)
pread64(fd: 7, buf: 0x7fb8bbf59000, count: 8192, pos: 258498560)
pread64(fd: 24, buf: 0x7fb8bbf5b000, count: 16384, pos: 258482176)

Q2 DESC

pread64(fd: 24, buf: 0x7fdcf0451000, count: 8192, pos: 598974464)
pread64(fd: 7, buf: 0x7fdcf0453000, count: 8192, pos: 598999040)
pread64(fd: 6, buf: 0x7fdcf0455000, count: 8192, pos: 598990848)
pread64(fd: 24, buf: 0x7fdcf0459000, count: 8192, pos: 599007232)
pread64(fd: 7, buf: 0x7fdcf045b000, count: 8192, pos: 599023616)
pread64(fd: 6, buf: 0x7fdcf045d000, count: 8192, pos: 599015424)
pread64(fd: 24, buf: 0x7fdcf045f000, count: 8192, pos: 599031808)
pread64(fd: 7, buf: 0x7fdcf0461000, count: 8192, pos: 599048192)
pread64(fd: 6, buf: 0x7fdcf0463000, count: 8192, pos: 599040000)
pread64(fd: 24, buf: 0x7fdcf0465000, count: 8192, pos: 599056384)
pread64(fd: 7, buf: 0x7fdcf0467000, count: 8192, pos: 599072768)
pread64(fd: 6, buf: 0x7fdcf0469000, count: 8192, pos: 599064576)
pread64(fd: 24, buf: 0x7fdcf046b000, count: 8192, pos: 599080960)
pread64(fd: 7, buf: 0x7fdcf046d000, count: 8192, pos: 599097344)
pread64(fd: 6, buf: 0x7fdcf046f000, count: 8192, pos: 599089152)
pread64(fd: 24, buf: 0x7fdcf0471000, count: 8192, pos: 599105536)
pread64(fd: 7, buf: 0x7fdcf0473000, count: 8192, pos: 599121920)
pread64(fd: 6, buf: 0x7fdcf0475000, count: 8192, pos: 599113728)

So, Q1 ASC gets to combine the I/O into nice large chunks. But the DESC
queries end up doing a stream of 8K requests. The Q2 ASC gets to do 16KB
reads in about half the cases, but the rest is still 8KB.

FWIW I believe this is what Thomas Munro meant by [1]:

You'll probably see a flood of uncombined 8KB IOs in the pg_aios
view while travelling up the heap with cache misses today.

It wasn't quite this obvious in pg_aios, though. I've usually seen only
a single event there, so hard to make conclusion. The trace makes it
pretty obvious, though. We don't combine the I/O, and we also know Linux
in fact does not do any readahead for backwards scans.

regards

[1]
https://www.postgresql.org/message-id/CA%2BhUKGKMaZLmNQHaa_DZMw9MJJKGegjrqnTY3KOZB-_nvFa3wQ%40mail.gmail.com

--
Tomas Vondra

In response to

Re: index prefetching at 2025-08-13 23:50:59 from Peter Geoghegan

Responses

Re: index prefetching at 2025-08-14 01:51:43 from Peter Geoghegan
Re: index prefetching at 2025-08-14 16:56:21 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Smith	2025-08-14 01:23:34	Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Previous Message	Thomas Munro	2025-08-14 00:42:09	Re: index prefetching