Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-06-22 20:48:10
Message-ID: CAH2-WzkZTkDuyVFszLwPJesF9pS5E8m0UA+344bx-B-zfA5kaw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 14, 2026 at 2:54 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v27, which cleans up some remaining loose ends.

V28 is attached.

Executive summary:

v28 includes better "start read stream" heuristics, adds SP-GiST
support for amgetbatch, improves the heuristics that bound the work
that get_actual_variable_range performs with a severely bloated index
during planning, and adds more extensive testing of tricky nbtree
edge-cases.

Full details:

* v28 replaces the "only start prefetching on second batch" heuristic
with a more principled approach based on the number of heap blocks
read so far. We now create a read stream the fourth time the scan
switches to another heap block. This cutoff was derived empirically,
by following a process driven by my microbenchmark suite.

This is the standout item in v28. Obviously, the old heuristic left
money on the table regarding prefetching benefits [1] -- that is the
main reason to replace it with something better. Certain
somewhat-selective queries are now benefit much more from prefetching,
without it regressing anything else. But the new heuristic has 2
additional nonobvious advantages:

1. Typically, the new heuristic will start prefetching much earlier
than the old one. But sometimes the opposite occurs: prefetching takes
*longer* to begin. That also helps performance (because we weigh the
actual cost we care about, not a noisy proxy for it).

This helps with some of my adversarial test queries, which still had
regressions before v28. I'm referring to adversarial queries that
perform lateral joins with a LIMIT on the inner side of a nestloop
join (e.g., the A8 query that some of us discussed over IM). Now there
are no regressions at all in my main microbenchmark test suite, which
was an unexpected bonus.

We now seem to avoid all regressions related to LIMITs, lateral joins,
nestloop antijoins, and nestloop semijoins -- without needing to pass
information from the planner down to the scan to do so.

2. The rules determining when to create a read stream are now exactly
the same for index-only scans -- no special case logic is required
anymore. This is simpler and more elegant.

The old heuristic was actually: "Start prefetching on the second
batch, except when there have been exactly zero heap fetches so far
(possible only during an inde-only scan)". The new heuristic only
considers heap fetches.

* A new v28 patch adds amgetbatch support to SP-GiST, the last in-core
index AM that still used the amgettuple interface. Performance
improves to a degree very similar to other index AMs: range scan
queries can be 20x to 25x faster.

An additional restriction applies to index-only scans, which is a
separate issue peculiar to SP-GiST: index-only scans are now disabled
for "long values" opclasses such as the text radix opclass. Unlike the
similar IoS issue in GiST + SP-GiST (the ordered scans issue), this is
a legitimate limitation in the amgetbatch design itself (there is no
existing bug involved here). Don't confuse these two separate
IoS-related issues.

Here's why I just desupported index-only scans for this particular
subset of SP-GiST opclasses: using "long values" doesn't fit well with
our resource management strategy during amgetbatch scans. Our usual
approach of reconstructing a value runs into the problem that the
required metadata size is essentially unbounded. The prefix cannot
reliably fit into a fixed per-batch reconstruction workspace; we can't
use a generic conservative estimate known at the start of the scan.
Note that some prefix style opclasses are not affected at all -- those
that already promise to keep their prefix size under BLCKSZ (our usual
amgettransform approach can work there).

I think we should just accept this limitation to gain the benefit of
prefetching during SP-GiST scans. I'm loathe to invent complicated new
heapam infrastructure just to deal with this problem. I'd rather just
not support SP-GiST at all.

The SP-GiST patch is still very much WIP. I changed how SP-GiST VACUUM
uses its own read stream (for prefetching on index pages) to avoid
obtaining self-conflicting cleanup locks, which is another aspect of
SP-GiST that presented a unique challenge. See the
SPGIST_VACUUM_DRAIN_INTERVAL mechanism.

* v28 includes a new patch that augments the existing
VISITED_PAGES_LIMIT mechanism (which get_actual_variable_range uses to
limit the amount of work its scan performs in pathological cases) with
a new INDEX_PAGES_LIMIT limit. This fixes a complaint from Mark
Callaghan [2] about planning time for range queries becoming excessive
with queue-like tables (because VISITED_PAGES_LIMIT alone was
ineffective).

Note that this is now independent work; it really has nothing to do
with amgetbatch support per se. I'm including it now, because the work
in this area is an offshoot of the new slot-based index scan design,
and was previously discussed on this thread.

Background: Earlier versions of this patch set included a roughly
comparable patch, which I abandoned towards the end of the PG 19
cycle. My earlier proposal wholly replaced VISITED_PAGES_LIMIT; this
new patch complements VISITED_PAGES_LIMIT, by specifically targeting
its one major weakness. That is, this new INDEX_PAGES_LIMIT mechanism
will only tally index leaf page reads that return zero matching items
to the table AM (i.e., those that don't return any batch). And so
INDEX_PAGES_LIMIT is complementary to VISITED_PAGES_LIMIT; it doesn't
replace it.

The results from one microbenchmark indicate that the pathological
case is fully fixed (times given are for planning time, in
milliseconds):

1. Baseline (no dead tuples):
Metric Master Patch Ratio
---------- ------------ ------------ ----------
Avg 0.140 0.129 0.916x
p95 0.213 0.188 0.882x
p99 0.259 0.224 0.867x

3. Main (with bulk-deleted tuples, concurrent inserts/deletes):
Metric Master Patch Ratio
---------- ------------ ------------ ----------
Avg 3.017 0.259 0.086x
p95 10.055 0.230 0.023x
p99 28.034 2.121 0.076x

It makes sense that planning time can double in extreme cases due to
concentrated index bloat (relative to the case with no bloat). It does
not make sense for it to increase by 20x or more. The important thing
is that the worst case is some fixed small-ish multiple of the
best/average case -- there's absolutely no sensible reason to expect
get_actual_variable_range to not at least achieve that.

* v28 also adds a new "read stream throttling" mechanism to prevent
calls to the read stream callback during index-only scans that return
many batches with no prefetchable heap blocks (because every batch
consists of matching items that are all-visible). This is also used
during plain index scans, though that is much less critical.

Throttling works by pausing the read stream when we detect that
amgetbatch calls aren't producing any useful heap block numbers for
the read stream to consume. It makes no sense for a single call to the
read stream callback to invoke amgetbatch *several* times without
returning even one usable heap block number to the read stream. As a
bonus, this adds test coverage for the pausing mechanism.

I developed an adversarial microbenchmark for this. v27 of the patch
performed pretty poorly here:

=== ios / cached ===
build / setting buffers (h+r) bufs vs master heap_fetch
io_read(ms) exec(ms) exec x
----------------------------------------------------------------------------------------------
master (no prefetch) 9 identical 5
- 0.017 1.00x
rc prefetch=off 9 identical 5
- 0.013 0.76x
rc prefetch=on 72 +63 5
- 0.276 16.24x

v28 leaves the rc prefetch=on case ~1.5x slower than
master/prefetch=off (not shown in the table). This seems quite
acceptable for an unrealistic LIMIT N microbenchmark such as this
(prefetching begins exactly where there'll be no more heap fetches,
which is inherently impossible to handle without any new overhead).

* v28 adds tests that cover nbtree edge cases involving array keys:
one test relies critically on a call to nbtree's btposreset for
mark/restore, another covers the case where btposreset is critical to
correctly backing up a cursor across a batch boundary for a scan with
array keys.

At one point Andres asked about adding such test coverage. It was
tricky to do it in a way that didn't require lots of data, but I found
a way to do it that keeps the added test cycles at an acceptable
level.

> There's also a new isolation test (dirty_snapshot.spec) which
> illustrates the role of dirty snapshots in constraint enforcement.

* I removed this in v28 because it didn't seem to be pulling its
weight: too many added test cycles for no actual new test coverage.

> > * There have been recent CI failures on "linux-autoconf", that I
> > haven't debugged.
> >
> > 001_aio.pl seems to reliably report "Dubious, test returned 4 (wstat
> > 1024, 0x400)" on this CI target. I fully expect this v26 to fail when
> > CFTester runs it through CI.

I suspect that this is just a bug in one of the tests added to
001_aio.pl. I can recreate the failure locally with a
-DRELCACHE_FORCE_RELEASE build.

For now I have disabled that one test case, without removing it, just
so CI passes.

[1] https://postgr.es/m/383515d3-1122-47fd-b6ea-09152500496a@garret.ru
[2] https://postgr.es/m/CAH2-Wzkt1WkKp4VRJu3qHfmKXc8W+XYv1RXg5d2d3fSvAeO=rg@mail.gmail.com
--
Peter Geoghegan

Attachment Content-Type Size
v28-0011-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch application/octet-stream 3.1 KB
v28-0009-Allow-read_stream_reset-to-not-wait-for-IO-compl.patch application/octet-stream 20.7 KB
v28-0010-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch application/octet-stream 6.3 KB
v28-0001-Add-slot-based-table-AM-index-scan-interface.patch application/octet-stream 121.4 KB
v28-0008-heapam-Add-index-scan-I-O-prefetching.patch application/octet-stream 55.1 KB
v28-0005-WIP-Adopt-amgetbatch-interface-in-GiST-index-AM.patch application/octet-stream 110.9 KB
v28-0006-WIP-Adopt-amgetbatch-interface-in-SP-GiST-index-.patch application/octet-stream 68.4 KB
v28-0007-heapam-Optimize-pin-transfers-during-index-scans.patch application/octet-stream 6.2 KB
v28-0004-Adopt-amgetbatch-interface-in-hash-index-AM.patch application/octet-stream 47.4 KB
v28-0003-Limit-get_actual_variable_range-leaf-page-reads.patch application/octet-stream 7.6 KB
v28-0002-Add-amgetbatch-interface-and-adopt-it-in-nbtree.patch application/octet-stream 268.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2026-06-22 20:50:37 Re: PG20 Minimum Dependency Thread
Previous Message Corey Huinker 2026-06-22 20:40:16 Re: postgres_fdw: Emit message when batch_size is reduced