From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2025-08-12 23:10:36 |
Message-ID: | b739c179-25ca-4371-b528-cf80ea111d94@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/12/25 23:52, Tomas Vondra wrote:
>
> On 8/12/25 23:22, Peter Geoghegan wrote:
>> ...
>>
>> It looks like the patch does significantly better with the forwards scan,
>> compared to the backwards scan (though both are improved by a lot). But that's
>> not the main thing about these results that I find interesting.
>>
>> The really odd thing is that we get "shared hit=6619 read=49933" for the
>> forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The
>> latter matches master (regardless of the scan direction used on master), while
>> the former just looks wrong. What explains the "missing buffer hits" seen with
>> the forwards scan?
>>
>> Discrepancies
>> -------------
>>
>> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
>> to simply be broken/giving wrong answers. Might it be that the "Buffers"
>> instrumentation is broken?
>>
>
> I think a bug in the prefetch patch is more likely. I tried with a patch
> that adds various prefetch-related counters to explain, and I see this:
>
>
> test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
> BETWEEN 16336 AND 49103 ORDER BY a;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Index Scan using idx on public.t (actual time=0.682..527.055
> rows=1048576.00 loops=1)
> Output: a, b
> Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
> Index Searches: 1
> Prefetch Distance: 271.263
> Prefetch Count: 60888
> Prefetch Stalls: 1
> Prefetch Skips: 991211
> Prefetch Resets: 3
> Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
> 24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
> Buffers: shared hit=5027 read=50872
> I/O Timings: shared read=33.528
> Planning:
> Buffers: shared hit=78 read=23
> I/O Timings: shared read=2.349
> Planning Time: 3.686 ms
> Execution Time: 559.659 ms
> (17 rows)
>
>
> test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
> BETWEEN 16336 AND 49103 ORDER BY a DESC;
> QUERY PLAN
> ------------------------------------------------------------------------
> Index Scan Backward using idx on public.t (actual time=1.110..4116.201
> rows=1048576.00 loops=1)
> Output: a, b
> Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
> Index Searches: 1
> Prefetch Distance: 271.061
> Prefetch Count: 118806
> Prefetch Stalls: 1
> Prefetch Skips: 962515
> Prefetch Resets: 3
> Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
> 17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
> Buffers: shared hit=30024 read=50872
> I/O Timings: shared read=581.353
> Planning:
> Buffers: shared hit=82 read=23
> I/O Timings: shared read=3.168
> Planning Time: 4.289 ms
> Execution Time: 4185.407 ms
> (17 rows)
>
> These two parts are interesting:
>
> Prefetch Count: 60888
> Prefetch Skips: 991211
>
> Prefetch Count: 118806
> Prefetch Skips: 962515
>
> It looks like the backwards scan skips fewer blocks. This is based on
> the lastBlock optimization, i.e. looking for runs of the same block
> number. I don't quite see why would it affect just the backwards scan,
> though. Seems weird.
>
Actually, this might be a consequence of how backwards scans work (at
least in btree). I logged the block in index_scan_stream_read_next, and
this is what I see in the forward scan (at the beginning):
index_scan_stream_read_next: block 24891
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
index_scan_stream_read_next: block 24894
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24897
index_scan_stream_read_next: block 24898
index_scan_stream_read_next: block 24899
index_scan_stream_read_next: block 24900
index_scan_stream_read_next: block 24901
index_scan_stream_read_next: block 24902
index_scan_stream_read_next: block 24903
index_scan_stream_read_next: block 24904
index_scan_stream_read_next: block 24905
index_scan_stream_read_next: block 24906
index_scan_stream_read_next: block 24907
index_scan_stream_read_next: block 24908
index_scan_stream_read_next: block 24909
index_scan_stream_read_next: block 24910
while in the backwards scan (at the end) I see this
index_scan_stream_read_next: block 24910
index_scan_stream_read_next: block 24911
index_scan_stream_read_next: block 24908
index_scan_stream_read_next: block 24909
index_scan_stream_read_next: block 24906
index_scan_stream_read_next: block 24907
index_scan_stream_read_next: block 24908
index_scan_stream_read_next: block 24905
index_scan_stream_read_next: block 24906
index_scan_stream_read_next: block 24903
index_scan_stream_read_next: block 24904
index_scan_stream_read_next: block 24905
index_scan_stream_read_next: block 24902
index_scan_stream_read_next: block 24903
index_scan_stream_read_next: block 24900
index_scan_stream_read_next: block 24901
index_scan_stream_read_next: block 24902
index_scan_stream_read_next: block 24899
index_scan_stream_read_next: block 24900
index_scan_stream_read_next: block 24897
index_scan_stream_read_next: block 24898
index_scan_stream_read_next: block 24899
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24897
index_scan_stream_read_next: block 24894
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
index_scan_stream_read_next: block 24894
index_scan_stream_read_next: block 24891
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
These are only the blocks that ended up passes to the read stream, not
the skipped ones. And you can immediately see the backward scan requests
more blocks for (roughly) the same part of the scan - the min/max block
roughly match.
The reason is pretty simple - the table is very correlated, and the
forward scan requests blocks mostly in the right order. Only rarely it
has to jump "back" when progressing to the next value, and so the
lastBlock optimization works nicely.
But with the backwards scan we apparently scan the values backwards, but
then the blocks for each value are accessed in forward direction. So we
do a couple blocks "forward" and then jump to the preceding value - but
that's a couple blocks *back*. And that breaks the lastBlock check.
I believe this applies both to master and the prefetching, except that
master doesn't have read stream - so it only does sync I/O. Could that
hide the extra buffer accesses, somehow?
Anyway, this access pattern in backwards scans seems a bit unfortunate.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2025-08-12 23:29:23 | Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected |
Previous Message | Jacob Champion | 2025-08-12 23:10:33 | Re: CI failures with Windows - VS2019 jobs |