Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-08-12 21:52:17
Message-ID: 51b5f71b-5f19-4453-91ff-2b9f2a840c58@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 8/12/25 23:22, Peter Geoghegan wrote:
> ...
>
> It looks like the patch does significantly better with the forwards scan,
> compared to the backwards scan (though both are improved by a lot). But that's
> not the main thing about these results that I find interesting.
>
> The really odd thing is that we get "shared hit=6619 read=49933" for the
> forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The
> latter matches master (regardless of the scan direction used on master), while
> the former just looks wrong. What explains the "missing buffer hits" seen with
> the forwards scan?
>
> Discrepancies
> -------------
>
> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
> to simply be broken/giving wrong answers. Might it be that the "Buffers"
> instrumentation is broken?
>

I think a bug in the prefetch patch is more likely. I tried with a patch
that adds various prefetch-related counters to explain, and I see this:

test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on public.t (actual time=0.682..527.055
rows=1048576.00 loops=1)
Output: a, b
Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
Index Searches: 1
Prefetch Distance: 271.263
Prefetch Count: 60888
Prefetch Stalls: 1
Prefetch Skips: 991211
Prefetch Resets: 3
Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
Buffers: shared hit=5027 read=50872
I/O Timings: shared read=33.528
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.349
Planning Time: 3.686 ms
Execution Time: 559.659 ms
(17 rows)

test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a DESC;
QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on public.t (actual time=1.110..4116.201
rows=1048576.00 loops=1)
Output: a, b
Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
Index Searches: 1
Prefetch Distance: 271.061
Prefetch Count: 118806
Prefetch Stalls: 1
Prefetch Skips: 962515
Prefetch Resets: 3
Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
Buffers: shared hit=30024 read=50872
I/O Timings: shared read=581.353
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=3.168
Planning Time: 4.289 ms
Execution Time: 4185.407 ms
(17 rows)

These two parts are interesting:

Prefetch Count: 60888
Prefetch Skips: 991211

Prefetch Count: 118806
Prefetch Skips: 962515

It looks like the backwards scan skips fewer blocks. This is based on
the lastBlock optimization, i.e. looking for runs of the same block
number. I don't quite see why would it affect just the backwards scan,
though. Seems weird.

> The premise of my original complaint was that big inconsistencies in performance
> shouldn't happen between similar forwards and backwards scans (at least not with
> direct I/O). I now have serious doubts about that premise, since it looks like
> OS readahead remains a big factor with direct I/O. Did I just miss something
> obvious?
>

I don't think you missed anything. It does seem the assumption relies on
the OS handling the underlying I/O patterns equally, and unfortunately
that does not seem to be the case. Maybe we could "invert" the data set,
i.e. make it "descending" instead of "ascending"? That would make the
heap access direction "forward" again ...

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-08-12 22:11:44 Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events
Previous Message Greg Burd 2025-08-12 21:42:35 Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock