Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2026-03-10 20:57:35
Message-ID:	CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Feb 27, 2026 at 6:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> This is a huge change. Is there a chance we can break it up into more
> manageable chunks?

Attached is v12, which has revisions that address most of your
feedback items. It also includes items that address problems that I
noticed during performance validation work.

Highlights:

* Substantial revisions that give table AMs and index AMs direct
control over batch layout -- without giving up on batch
recycling/caching. This is essentially what you (Andres) requested
because the design from v11 was not sufficiently AM agnostic. In
particular:

- Table AMs now control the size and layout of visibility information
(in practice heapam uses this to store per-item visibility state from
the visibility map).

- Index AMs have their own opaque state for things like sibling link
block numbers, avoiding the assumption that other index AMs supporting
amgetbatch will need to work like nbtree and hash as regards how they
navigate to the next index page/index keyspace associated with each
batch.

* No more read stream yielding. Numerous new patches from Andres are
now included, which helps with this. In particular, "WIP: read_stream:
Only increase distance when waiting for IO" fixes the problematic
regression in an adversarial query -- the one that prompted me to
invent yielding in the first place. As a result of all this, the read
stream callback added by the prefetching commit itself is now
substantially simpler than it was in v11.

* There are now a couple of extra patches created by breaking things
into more distinct commits. Namely, there's a new "heapam: Track heap
block in IndexFetchHeapData using xs_blk" commit, as well as a new
"Make IndexScanInstrumentation a pointer in executor scan nodes"
commit.

* Moreover, some commits now appear in a slightly different order,
prioritizing work closer to being committable; those commits now come
first.

* The SGML doc changes are now broken up a bit better: the commit that
adds basic amgetbatch functionality now includes SGML doc updates
describing the required API revisions. The SGML doc changes that
describe prefetching concepts now appear in the later prefetching
commit.

* New commit "Use simple hash for PrivateRefCount" addresses some of
the problems we were seeing with PrivateRefCount performance. This
generic optimization addresses an existing problem that would
otherwise be much worse with the index prefetching work in place.

* The new xs_read_extremal_only mechanism (which replaces selfuncs.c's
current use of VISITED_PAGES_LIMIT) will now hold on for longer -- up
to 3 index leaf pages can be read before we give up on the scan. The
number of pages that we're willing to read before giving up is now
under the control of the selfuncs.c caller. This addresses concerns
that Tom had about selfuncs.c giving up too easily, for example when
the rightmost/leftmost leaf page happens to have very few items.

* v12 fixes a ~5% regression in bitmap index scans, which was tied to
the fact that we were naively using retail pallocs to allocate new
batches in all cases. The fix gives amgetbitmap routines the ability
to use their own batch cache mechanism, similar to the one used during
amgetbatch scans. In practice, btgetbitmap and hashgetbitmap scans
will never need to do more than one palloc to allocate a batch; they
can continually reuse the same batch. This relies on the fact that
amgetbitmap routines don't actually need to hold open more than a
single batch at a time/maintain a batch ring buffer (just like master
manages currPos state).

* v12 addresses certain performance issues with nestloop joins and
merge joins by no longer dropping the scan's heap page buffer pin on
rescan. This keeps regressions with fully cached workloads to an
absolute minimum, and seems to have no real downside.

However, I have NOT yet acted on a few feedback items from Andres:

* I still don't know what Andres meant about requiring table AMs to
free batch index page buffer pins representing a modularity violation.
I don't see how we can reasonably avoid it while still preserving the
guarantees needed to safely drop buffer pins eagerly during index-only
scans that require prefetching.

* I'm also not at all sure what Andres meant about index AMs like hash
not holding onto their own buffer pins, given that prefetching uses a
read stream sensitive to the number of buffer pins the backend holds.

I understand the general concern about interfering with the read
stream, I think, but I don't see how imposing such a strict rule on
index AMs, which have always been allowed to do this, can be practical
(though note that nbtree doesn't actually hold on to any buffer pins
for internal purposes, and hasn't since Postgres 18 commit 1bd4bc85,
so at least nbtree can be relied on to never do this). I asked Andres
about these 2 points not too long ago [1], and await further feedback
on those questions.

* Separate from the 2 closely related items about batch buffer pins, I
have not yet made most of the changes Andres requested to the "Use
ExecSetTupleBound hint during index scans" commit -- just because I
haven't had time. v11 of the patch set hasn't applied cleanly in about
a week, so I didn't want to further delay posting v12.

Thanks

[1] https://postgr.es/m/CAH2-Wznxu+AFz-EBOG-XiRA_R3nXLp45NEiGSD3ebx3h=OKPAw@mail.gmail.com

--
Peter Geoghegan

Attachment	Content-Type	Size
v12-0021-Conditional-locking-in-pgaio_worker_submit_inter.patch	application/octet-stream	3.4 KB
v12-0022-Add-fake-LSN-support-to-hash-index-AM.patch	application/octet-stream	13.8 KB
v12-0001-Use-simple-hash-for-PrivateRefCount.patch	application/octet-stream	5.8 KB
v12-0020-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch	application/octet-stream	4.1 KB
v12-0023-Make-hash-index-AM-use-amgetbatch-interface.patch	application/octet-stream	41.4 KB
v12-0019-WIP-read_stream-Prevent-distance-from-decaying-t.patch	application/octet-stream	2.9 KB
v12-0017-bufmgr-Fix-use-of-wrong-variable-in-GetPrivateRe.patch	application/octet-stream	1.1 KB
v12-0018-WIP-read_stream-Only-increase-distance-when-wait.patch	application/octet-stream	2.4 KB
v12-0015-WIP-aio-io_uring-Allow-IO-methods-to-check-if-IO.patch	application/octet-stream	4.6 KB
v12-0016-bufmgr-Return-whether-WaitReadBuffers-needed-to-.patch	application/octet-stream	2.8 KB
v12-0014-WIP-read_stream-Issue-IO-synchronously-while-in-.patch	application/octet-stream	2.0 KB
v12-0012-Make-buffer-hit-helper.patch	application/octet-stream	6.0 KB
v12-0011-Use-ExecSetTupleBound-hint-during-index-scans.patch	application/octet-stream	10.9 KB
v12-0010-Add-heapam-index-scan-I-O-prefetching.patch	application/octet-stream	44.1 KB
v12-0013-Don-t-wait-for-already-in-progress-IO.patch	application/octet-stream	20.6 KB
v12-0009-Limit-get_actual_variable_range-to-scan-three-in.patch	application/octet-stream	9.3 KB
v12-0007-heapam-Track-heap-block-in-IndexFetchHeapData-us.patch	application/octet-stream	3.7 KB
v12-0005-Use-fake-LSNs-to-improve-nbtree-dropPin-behavior.patch	application/octet-stream	15.1 KB
v12-0006-Make-IndexScanInstrumentation-a-pointer-in-execu.patch	application/octet-stream	9.1 KB
v12-0008-Add-interfaces-that-enable-index-prefetching.patch	application/octet-stream	246.2 KB
v12-0004-Use-GetXLogInsertEndRecPtr-in-XLogGetFakeLSN.patch	application/octet-stream	3.1 KB
v12-0002-Don-t-allocate-_bt_search-stack.patch	application/octet-stream	8.7 KB
v12-0003-Extract-fake-LSN-infrastructure-from-GiST-index-.patch	application/octet-stream	16.6 KB

In response to

Re: index prefetching at 2026-02-27 23:52:09 from Andres Freund

Responses

Re: index prefetching at 2026-03-10 22:47:52 from Andres Freund
Re: index prefetching at 2026-03-12 14:43:28 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2026-03-10 21:00:48	Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT
Previous Message	Masahiko Sawada	2026-03-10 20:55:44	Re: Use allocation macros in the logical replication code