Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-08-13 16:01:52
Message-ID:	CAH2-Wz=L7h-koDKa3_NEg39Faw7MrOkSVOsodvQ4toSQahvWjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 13, 2025 at 11:28 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > With "sync" I always get this (after a restart):
> >
> > Buffers: shared hit=7435 read=52801
> >
> > while with "worker" I get this:
> >
> > Buffers: shared hit=4879 read=52801
> > Buffers: shared hit=5151 read=52801
> > Buffers: shared hit=4978 read=52801
> >
> > So not only it changes run to tun, it also does not add up to 60236.
>
> This is reproducible on master? If so, how?

AFAIK it is *not* reproducible on master.

> If this happens only with the prefetching patch applied, is is possible that
> what happens here is that we occasionally re-request buffers that already in
> the process of being read in? That would only happen with a read stream and
> io_method != sync (since with sync we won't read ahead). If we have to start
> reading in a buffer that's already undergoing IO we wait for the IO to
> complete and count that access as a hit:

This theory seems quite plausible to me. Though it is a bit surprising
that I see incorrect buffer hit counts on the "good" forwards scan
case, rather than on the "bad" backwards scan case.

Here's what I mean by things being broken on the read stream side (at
least with certain backwards scan cases):

When I add instrumentation to the read stream side, by adding elog
debug calls that show the blocknum seen by read_stream_get_block, I
see out-of-order and repeated blocknums with the "bad" backwards scan
case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a
desc"):

...
NOTICE: index_scan_stream_read_next: index 1163 TID (25052,21)
WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052
WARNING: blocknum: 25052, 0x55614810efb0
WARNING: blocknum: 25052, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1161 TID (25053,3)
WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25053
WARNING: blocknum: 25053, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1160 TID (25052,19)
WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052
WARNING: blocknum: 25052, 0x55614810efb0
WARNING: blocknum: 25052, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1141 TID (25051,21)
WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25051
WARNING: blocknum: 25051, 0x55614810efb0
...

Notice that we see the same blocknum twice in close succession. Also
notice that we're passed 25052 and then subsequently passed 25053,
only to be passed 25053 once more.

OTOH, when I run the equivalent "good" backwards scan ("SELECT * FROM
t WHERE a BETWEEN 16336 AND 49103 ORDER BY a"), the output looks just
about perfect. I have to look around quite a bit longer before I can
find repeated blocknum within successive read_stream_get_block calls:

...
NOTICE: index_scan_stream_read_next: index 303 TID (74783,1)
WARNING: prior lastBlock is 74782 for batchno 2862, new one: 74783
WARNING: blocknum: 74783, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 323 TID (74784,1)
WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784
WARNING: blocknum: 74784, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 324 TID (74783,21)
WARNING: prior lastBlock is 74784 for batchno 2862, new one: 74783
WARNING: blocknum: 74783, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 325 TID (74784,2)
WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784
WARNING: blocknum: 74784, 0x55614810efb0
...

These out-of-order repeat requests are much rarer. And I *never* see
identical requests in *immediate* succession, whereas those are common
with the backwards scan case.

I believe that the out-of-order repeat requests shown here are a
legitimate consequence of the TIDs being slightly out of order in
relatively few places (so the forwards scan case may well already be
behaving exactly as I expect):

pg(at)regression:5432 [2470184]=# select ctid, a from t where ctid
between '(74783,1)' and '(74784,1)';
┌────────────┬────────┐
│ ctid │ a │
├────────────┼────────┤
│ (74783,1) │ 49,077 │
│ (74783,2) │ 49,077 │
│ (74783,3) │ 49,077 │
│ (74783,4) │ 49,077 │
│ (74783,5) │ 49,077 │
│ (74783,6) │ 49,077 │
│ (74783,7) │ 49,077 │
│ (74783,8) │ 49,077 │
│ (74783,9) │ 49,077 │
│ (74783,10) │ 49,077 │
│ (74783,11) │ 49,077 │
│ (74783,12) │ 49,077 │
│ (74783,13) │ 49,077 │
│ (74783,14) │ 49,077 │
│ (74783,15) │ 49,077 │
│ (74783,16) │ 49,077 │
│ (74783,17) │ 49,077 │
│ (74783,18) │ 49,077 │
│ (74783,19) │ 49,077 │
│ (74783,20) │ 49,077 │
│ (74783,21) │ 49,078 │
│ (74784,1) │ 49,077 │
└────────────┴────────┘
(22 rows)

Bear in mind that EXPLAIN ANALYZE shows *identical* "Buffers:" details
for each query on master. So I believe that I am completely justified
in expecting the calls to read_stream_get_block for the backwards scan
to use identical blocknums to the ones for the equivalent/good
forwards scan (except that they should be in the exact opposite
order). And yet that's not what I see.

Maybe this is something to do with the read position and the stream
position becoming mixed up? I find it odd that the relevant readstream
callback, index_scan_stream_read_next, says "If the stream position
is undefined, just use the read position". That's just a guess,
though. This issue is tricky to debug. I'm not yet used to debugging
problems such as these (though I'll probably become an expert on it in
the months ahead).

--
Peter Geoghegan

In response to

Re: index prefetching at 2025-08-13 14:44:33 from Andres Freund

Responses

Re: index prefetching at 2025-08-13 21:19:29 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Timur Magomedov	2025-08-13 16:28:34	Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Previous Message	Peter Eisentraut	2025-08-13 16:00:29	Re: [PATCH] Proposal: Improvements to PDF stylesheet and table column widths