Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-02-09 22:44:11
Message-ID: CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 30, 2026 at 7:18 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v9.

Attached is v10. There are 2 major areas of improvement in this latest revision:

1. We have enhanced the read stream callback (heapam_getnext_stream,
which is added by
v10-0005-Add-prefetching-to-index-scans-using-batch-inter.patch),
making it yield at key intervals. When we yield, we temporarily
suspend prefetching -- but only for long enough to give the scan the
opportunity to return one more matching tuple (don't confuse yielding
with pausing; we do both, but the goals are rather different in each
case).

Yielding like this keeps the scan responsive during prefetching: the
scan should never go too long without returning at least one tuple
(except when that just isn't possible at all). Testing has shown that
keeping the scan responsive in this sense is particularly important
during scans that appear in "ORDER BY ... LIMIT N" queries, as well as
during scans that feed into certain merge joins. IOW, it is
particularly important that we "keep the scan responsive" whenever it
has the potential to allow the scan to shut down before it has
performed work that turns out to be unnecessary (though it also seems
to have some benefits even when that isn't the case).

There is a complex trade-off here: we want to yield when we expect
some benefit from doing so. But we don't want to yield when doing so
risks compromising the read stream's ability to maintain an adequate
prefetch distance. There are comments in heapam_getnext_stream that
describe the theory in more detail. There are heuristics that were
derived using adversarial benchmarking/stress-testing. I'm sure that
they need more work, but this does seem like roughly the right idea.
Note that we now test whether the scan's read stream is using its
fast-path mode (read stream uses this when the scan reads heap pages
that are all cached).

2. A new patch (v10-0007-Limit-get_actual_variable_range-to-scan-one-inde.patch)
compensates for the fact that the main prefetching commit removes
get_actual_variable_range's VISITED_PAGES_LIMIT mechanism. Since
VISITED_PAGES_LIMIT cannot easily be ported over to the new table AM
interface selfuncs.c now uses.

I described the problem that we need to address when I posted v9:

> selfuncs.c problem
> ------------------
>
> Also worth noting that we recently found a problem with selfuncs.c:
> the VISITED_PAGES_LIMIT stuff in selfuncs.c is broken right now. v9
> tears that code out, pending adding back a real fix (earlier versions
> of the patch had VISITED_PAGES_LIMIT, but it didn't work).
>
> The underlying problem is that the existing VISITED_PAGES_LIMIT design
> is incompatible with our new table_index_getnext_slot interface. The
> new interface doesn't stop scanning until it is able to at least
> return 1 tuple. But VISITED_PAGES_LIMIT was invented precisely because
> get_actual_variable_endpoint's index scans took far too long, even
> though they're only ever required to locate 1 tuple. So that just
> won't work.
>
> We'll need to invent some kind of API that directly acknowledges the
> needs of the selfuncs.c caller, and others like it. Doing it in an
> ad-hoc way just doesn't seem possible anymore. That will have to wait
> for the next revision, though (or the one after that).

The new patch deals with the problem in a completely different way,
and at a completely different layer: it adds a new
IndexScanDescData.xs_read_extremal_only field, set only by
get_actual_variable_range. When nbtree sees that the field has been
set, it gives up after scanning only one leaf page (the page that
contains extremal values that are of interest to
get_actual_variable_range). Note that we completely stop caring about
heap page fetches with this new approach.

There are good reasons to believe that the new
IndexScanDescData.xs_read_extremal_only approach will solve existing
problems with VISITED_PAGES_LIMIT. Recent benchmarking from Mark
Callaghan [1] (which I've independently recreated with my own minimal
test suite) shows that VISITED_PAGES_LIMIT becomes completely
ineffective, once we reach a tipping point where many index tuples at
one end of the index all have their LP_DEAD bit set.

I'm going to start a new thread to discuss the issues in this area
later today. I'm aiming to fix an existing, independent issue in this
new patch, so it makes sense to discuss it on a completely separate
thread.

[1] https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html
--
Peter Geoghegan

Attachment Content-Type Size
v10-0011-Make-hash-index-AM-use-amgetbatch-interface.patch application/x-patch 39.6 KB
v10-0008-Make-buffer-hit-helper.patch application/x-patch 5.8 KB
v10-0001-Extract-fake-LSN-infrastructure-from-GiST-index-.patch application/x-patch 16.9 KB
v10-0009-Don-t-wait-for-already-in-progress-IO.patch application/x-patch 20.6 KB
v10-0010-Add-fake-LSN-support-to-hash-index-AM.patch application/x-patch 13.8 KB
v10-0005-Add-prefetching-to-index-scans-using-batch-inter.patch application/x-patch 70.0 KB
v10-0007-Limit-get_actual_variable_range-to-scan-one-inde.patch application/x-patch 8.7 KB
v10-0004-Introduce-read_stream_-pause-resume-yield.patch application/x-patch 11.8 KB
v10-0006-Use-ExecSetTupleBound-hint-during-index-scans.patch application/x-patch 9.9 KB
v10-0003-Add-batching-interfaces-used-by-heapam-and-nbtre.patch application/x-patch 200.4 KB
v10-0002-Use-fake-LSNs-to-improve-nbtree-dropPin-behavior.patch application/x-patch 15.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2026-02-09 23:14:27 Problems with get_actual_variable_range's VISITED_PAGES_LIMIT
Previous Message Matheus Alcantara 2026-02-09 22:39:43 Re: Add CREATE SCHEMA ... LIKE support