Re: index prefetching

From: "Peter Geoghegan" <pg(at)bowt(dot)ie>
To: "Thomas Munro" <thomas(dot)munro(at)gmail(dot)com>
Cc: "Tomas Vondra" <tomas(at)vondra(dot)me>, "Andres Freund" <andres(at)anarazel(dot)de>, "Nazir Bilal Yavuz" <byavuz81(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Melanie Plageman" <melanieplageman(at)gmail(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Georgios" <gkokolatos(at)protonmail(dot)com>, "Konstantin Knizhnik" <knizhnik(at)garret(dot)ru>, "Dilip Kumar" <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-08-12 21:22:20
Message-ID: DC0RJDW1PEAX.NDGIXR6WPM94@bowt.ie
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue Aug 12, 2025 at 1:06 AM EDT, Thomas Munro wrote:
> I'd be interested to hear more about reverse scans. Bilal was
> speculating about backwards I/O combining in read_stream.c a while
> back, but we didn't have anything interesting to use it yet. You'll
> probably see a flood of uncombined 8KB IOs in the pg_aios view while
> travelling up the heap with cache misses today. I suspect Linux does
> reverse sequential prefetching with buffered I/O (less sure about
> other OSes) which should help but we'd still have more overheads than
> we could if we combined them, not to mention direct I/O.

Doesn't look like Linux will do this, if what my local testing shows is anything
to go on. I'm a bit surprised by this (I also thought that OS readahead on linux
was quite sophisticated).

There does seem to be something fishy going on with the patch here. I can see
strange inconsistencies in EXPLAIN ANALYZE output when the server is started
with --debug_io_direct=data with the master, compared to what I see with the
patch.

Test case
=========

My test case is a minor refinement of Tomas' backwards scan test case from
earlier today, though with one important difference: I ran
"alter index idx set (deduplicate_items = off); reindex index idx;" to get a
pristine index without any posting lists (since the unrelated issue with posting
list TIDs otherwise risks obscuring something relevant).

master
------

pg(at)regression:5432 [2390630]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg(at)regression:5432 [2390630]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual time=0.117..982.469 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10353 read=49933 │
│ I/O Timings: shared read=861.953 │
│ Planning: │
│ Buffers: shared hit=63 read=20 │
│ I/O Timings: shared read=1.898 │
│ Planning Time: 2.131 ms │
│ Execution Time: 1015.679 ms │
└────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

pg(at)regression:5432 [2390630]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg(at)regression:5432 [2390630]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual time=7.919..6340.579 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10350 read=49933 │
│ I/O Timings: shared read=6219.776 │
│ Planning: │
│ Buffers: shared hit=5 │
│ Planning Time: 0.076 ms │
│ Execution Time: 6374.008 ms │
└──────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Notice that readahead seems to be effective with the forwards scan only (even
though I'm using debug_io_direct=data for this). Also notice that each query
shows identical "Buffers:" output -- that detail is exactly as expected.

Prefetch patch
--------------

Same pair of queries/prewarming/eviction steps with my working copy of the
prefetching patch:

pg(at)regression:5432 [2400564]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg(at)regression:5432 [2400564]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual time=0.136..298.301 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6619 read=49933 │
│ I/O Timings: shared read=45.313 │
│ Planning: │
│ Buffers: shared hit=63 read=20 │
│ I/O Timings: shared read=2.232 │
│ Planning Time: 2.634 ms │
│ Execution Time: 330.379 ms │
└────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

pg(at)regression:5432 [2400564]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg(at)regression:5432 [2400564]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual time=7.926..1201.988 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10350 read=49933 │
│ I/O Timings: shared read=194.774 │
│ Planning: │
│ Buffers: shared hit=5 │
│ Planning Time: 0.097 ms │
│ Execution Time: 1236.655 ms │
└──────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

It looks like the patch does significantly better with the forwards scan,
compared to the backwards scan (though both are improved by a lot). But that's
not the main thing about these results that I find interesting.

The really odd thing is that we get "shared hit=6619 read=49933" for the
forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The
latter matches master (regardless of the scan direction used on master), while
the former just looks wrong. What explains the "missing buffer hits" seen with
the forwards scan?

Discrepancies
-------------

All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
to simply be broken/giving wrong answers. Might it be that the "Buffers"
instrumentation is broken?

The premise of my original complaint was that big inconsistencies in performance
shouldn't happen between similar forwards and backwards scans (at least not with
direct I/O). I now have serious doubts about that premise, since it looks like
OS readahead remains a big factor with direct I/O. Did I just miss something
obvious?

>> I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
>> read_stream.c's split IO handling") fixed the issue, without anyone
>> realizing that the bug in question could manifest like this.
>
> I can't explain that. If you can consistently reproduce the change at
> the two base commits, maybe bisect?

Commit b4212231 was a wild guess on my part. Probably should have refrained
from that.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Burd 2025-08-12 21:42:35 Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Previous Message Arseniy Mukhin 2025-08-12 21:19:58 Re: amcheck support for BRIN indexes