Quick Links

Re: index prefetching

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-08-13 12:15:37
Message-ID:	6cb6109d-71d6-490c-8056-d8885081b008@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 8/13/25 01:33, Peter Geoghegan wrote:
> On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> Actually, this might be a consequence of how backwards scans work (at
>> least in btree). I logged the block in index_scan_stream_read_next, and
>> this is what I see in the forward scan (at the beginning):
>
> Just to be clear: you did disable deduplication and then reindex,
> right? You're accounting for the known issue with posting list TIDs
> returning TIDs in the wrong order, relative to the scan direction
> (when the scan direction is backwards)?
>
> It won't be necessary to do this once I commit my patch that fixes the
> issue directly, on the nbtree side, but for now deduplication messes
> things up here. And so for now you have to work around it.
>

No, I forgot about that (and the the patch only applies to master).

>> But with the backwards scan we apparently scan the values backwards, but
>> then the blocks for each value are accessed in forward direction. So we
>> do a couple blocks "forward" and then jump to the preceding value - but
>> that's a couple blocks *back*. And that breaks the lastBlock check.
>
> I don't think that this should be happening. The read stream ought to
> be seeing blocks in exactly the same order as everything else.
>
>> I believe this applies both to master and the prefetching, except that
>> master doesn't have read stream - so it only does sync I/O.
>
> In what sense is it an issue on master?
>
> On master, we simply access the TIDs in whatever order amgettuple
> returns TIDs in. That should always be scan order/index key space
> order, where heap TID counts as a tie-breaker/affects the key space in
> the presence of duplicates (at least once that issue with posting
> lists is fixed, or once deduplication has been disabled in a way that
> leaves no posting list TIDs around via a reindex).
>
> It is certainly not surprising that master does poorly on backwards
> scans. And it isn't all that surprising that master does worse on
> backwards scans when direct I/O is in use (per the explanation
> Andres offered just now). But master should nevertheless always read
> the TIDs in whatever order it gets them from amgettuple in.
>
> It sounds like amgetbatch doesn't really behave analogously to master
> here, at least with backwards scans. It sounds like you're saying that
> we *won't* feed TIDs heap block numbers to the read stream in exactly
> scan order (when we happen to be scanning backwards) -- which seems
> wrong to me.
>
> As you pointed out, a forwards scan of a DESC column index should feed
> heap blocks to the read stream in a way that is very similar to an
> equivalent backwards scan of a similar ASC column on the same table.
> There might be some very minor differences, due to differences in the
> precise leaf page boundaries among each of the indexes. But that
> should hardly be noticeable at all.
>

I gave this another try, this time with disabled deduplication, and on
master I also applied the patch (but now I realize that's probably
unnecessary, right?).

I did a couple more things for this experiment:

1) created a second table with an "inverse pattern" that's decreasing:

create table t2 (like t) with (fillfactor = 20);
insert into t2 select -a, b from t;
create index idx2 on t2 (a);
alter index idx2 set (deduplicate_items = false);
reindex index idx2;

The idea is that

SELECT * FROM t WHERE (a BETWEEN x AND y) ORDER BY a ASC

is the same "block pattern" as

SELECT * FROM t2 WHERE (a BETWEEN -y AND -x) ORDER BY a DESC

2) added logging to heapam_index_fetch_tuple

elog(LOG, "heapam_index_fetch_tuple block %u",
ItemPointerGetBlockNumber(tid));

3) disabled autovacuum (so that it doesn't trigger any logs)

4) python script that processes the block numbers and counts number of
blocks, runs, forward/backward advances

5) bash script that runs 4 "equivalent" queries on t/t2, with ASC/DESC.

And the results look like this (FWIW this is with io_method=sync):

Q1: SELECT * FROM t WHERE a BETWEEN 16336 AND 49103
Q2: SELECT * FROM t2 WHERE a BETWEEN -49103 AND -16336

master / buffered

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 575 1048576 57365 53648 3716
Q1 DESC 10245 1048576 57365 3716 53648
Q2 ASC 14819 1048576 86061 53293 32767
Q2 DESC 1063 1048576 86061 32767 53293

prefetch / buffered

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 701 1048576 57365 53648 3716
Q1 DESC 1805 1048576 57365 3716 53648
Q2 ASC 1221 1048576 86061 53293 32767
Q2 DESC 2101 1048576 86061 32767 53293

master / direct

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 6101 1048576 57365 53648 3716
Q1 DESC 12041 1048576 57365 3716 53648
Q2 ASC 14837 1048576 86061 53293 32767
Q2 DESC 14690 1048576 86061 32767 53293

prefetch / direct

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 1504 1048576 57365 53648 3716
Q1 DESC 9034 1048576 57365 3716 53648
Q2 ASC 6988 1048576 86061 53293 32767
Q2 DESC 8959 1048576 86061 32767 53293

The timings are from runs without the extra logging, but there's still
quite a bit of run to run variation. But the differences are somewhat
stable.

Some observations:

* The block stats are perfectly stable (for each query), both for each
build and between builds. And also perfectly symmetrical between the
ASC/DESC version of each query. The ASC does the same number of
"forward" steps like DESC does "backward" steps.

* There's a clear difference between Q1 and Q2, with Q2 having many more
runs (and not as "nice" forward/backward steps). When I created the t2
data set, I expected Q1 ASC to behave the same as Q2 DESC, but it
doesn't seem to work that way. Clearly, the "descending" pattern in t2
breaks the sequence of block numbers into many more runs.

>> Could that hide the extra buffer accesses, somehow?
>
> I think that you meant to ask about *missing* buffer hits with the
> patch, for the forwards scan. That doesn't agree with the backwards
> scan with the patch, nor does it agree with master (with either the
> forwards or backwards scan). Note that the heap accesses themselves
> appear to have sane/consistent numbers, since we always see
> "read=49933" as expected for those, for all 4 query executions that I
> showed.
>
> The "missing buffer hits" issue seems like an issue with the
> instrumentation itself. Possibly one that is totally unrelated to
> everything else we're discussing.
>

Yes, I came to this conclusion too. The fact that the stats presented
above are exactly the same for all the different cases (for each query)
is a sign it's about the tracking.

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

In any case, it seems to depend on io_method, and it's confusing.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
stats.py	text/x-python	480 bytes
explain-analyze.sh	application/x-shellscript	793 bytes

In response to

Re: index prefetching at 2025-08-12 23:33:57 from Peter Geoghegan

Responses

Re: index prefetching at 2025-08-13 14:44:33 from Andres Freund
Re: index prefetching at 2025-08-13 16:36:01 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2025-08-13 12:39:47	Report reorder buffer size
Previous Message	Amit Kapila	2025-08-13 12:12:28	Re: Conflict detection for update_deleted in logical replication