Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-06-14 18:54:26
Message-ID: CAH2-Wzn+C=mAMv9aW3Skfh80JPpHKT3yM=DYwkrrhYyG2f+_vg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 6, 2026 at 8:52 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v26, which resolves many outstanding issues from when we
> paused work on this, at the end of the final CF for Postgres 19.

Attached is v27, which cleans up some remaining loose ends.

* The standout item here is that v27 simplifies
heapam_index_getnext_scanbatch_pos by making it wholly rely on new,
higher level helper functions in indexbatch.h.

Similar simplifications exist for heapam_index_prefetch_next_block,
our read stream callback.

Andres requested these changes on April 5 due to concerns about code
duplication across table AMs. That's one reason to do things this way,
but there's another: heapam is no longer "partially responsible" for
managing the scan's batch ring buffer. For example,
heapam_index_getnext_scanbatch_pos no longer has to invalidate
prefetchPos itself so that heapam_index_prefetch_next_block cannot
become confused by prefetchPos.batch getting so far behind
scanPos.batch, making it appear ahead instead of behind (due to uint8
wraparound). In general, heapam knows nothing about these batchringbuf
implementation details as of v27.

Removing an obsolescent scanBatch from the ringbuffer also lives in
one of the new helpers -- that's obviously one aspect of batchringbuf
management that heapam shouldn't have to deal with. Especially because
some of the indexbatch.c helpers assume that the table AM follows
certain conventions. In particular, the routines that deal with
mark/restore assume that scanBatch must be the head slot/entry of
batchringbuf when called; the new helper functions enforce that
scheme.

Note that batchringbuf management works essentially the same way as in
previous versions; v27 improves things by fully taking it out of
heapam's hands.
Importantly, the concept of the table AM applying its own handling
when scanPos falls behind prefetchPos is now encapsulated in these new
helper functions. This doesn't assume the use of a read stream, nor
does it place other inappropriate restrictions on table AMs.

The heapam implementation side keeps its handling for obviously
heapam-related tasks: heapam_index_getnext_scanbatch_pos must still
return a TID using heapam-specific logic (which considers the
visibility map cache and so on), and must still unpause a paused read
stream (heapam owns its read stream, and it's possible that other
table AMs won't even use a read stream but will require a similar
pausing mechanism). Obviously, heapam_index_prefetch_next_block still
pauses the read stream when it detects that we've run out of
batchringbuf slots. (Basically, the concept of pausing a read stream
is decoupled from the concept of running out of batchringbuf slots.)

Version 27 also includes a tangentially related style improvement:

* v27 moves most of the new batch-related helpers that previously
appeared in relscan.h into indexbatch.h.

We still declare structs like IndexScanBatchData and BatchRingBuffer
in relscan.h (which is needed to embed batchringbuf in IndexScanDesc),
but all batch accessor functions and batchringbuf
manipulation/maintenance functions now live in indexbatch.h. This was
a pretty natural consequence of inventing the aforementioned new
higher-level helper functions that heapam now uses.

Some of the "Elementary batch ring buffer operation" functions (e.g.,
index_scan_batch_append, index_scan_pos_startbatch,
index_scan_pos_advance, index_scan_batch_full) are no longer called
directly by heapam; they are now only called through the new higher
level helper functions. Others (e.g., index_scan_pos_cmp) are still
called from within heapam, but only within documenting assertions.

> We do keep the underlying heapam struct (IndexFetchHeapData), which is
> now accessed through a void pointer in IndexScanDesc -- exactly like
> it's done with index AMs. There's no more "inheritance from
> IndexFetchTableData", which seems to add nothing anymore.

* This new v27 renames IndexFetchHeapData to IndexScanHeapData.

This makes sense because the struct now represents "the heapam private
state that relates to an index scan", not "the heapam state that
relates to the part of the index scan that just deals with fetching
heap tuples for a given TID".

I also added new and better test infrastructure:

* Valgrind marks recycled batch memory as UNDEFINED, which should
catch cases where AMs have dangling references to recycled batch
memory that was never actually pfree'd.

* Added test_indexscan, a new test module that performs index scans
with a variety of snapshot types (plus accompanying pg_regress tests).

Without this, no test will fail if each call to
table_index_getnext_slot (actually, the underlying slot-based callback
that heapam implements) starts with scan->xs_heap_continue set to
false unconditionally. IOW, we actually have some chance of noticing
if there's a bug that makes heapam fail to correctly maintain
scan->xs_heap_continue. That's only really possible during non-MVCC
scans, and is very difficult to test without dedicated infrastructure
such as this.

There's also a new isolation test (dirty_snapshot.spec) which
illustrates the role of dirty snapshots in constraint enforcement. For
illustrative purposes, it compares what a dirty snapshot sees to what
the xact's active MVCC snapshot sees at the same point in time across
various isolation test schedules (e.g., xact commit vs xact abort, HOT
update vs non-HOT update). I'm not sure if this extra isolation test
is actually worth the added test cycles, but thought I'd include it
for now. Maybe there's a historical bug that we could add test
coverage for using dirty_snapshot.spec?

* Related to the last item, v27 formalizes the rules around what is
allowed during non-MVCC scans.

I think it's worth grounding the rules in this area rigorously,
supported by defensive documentation assertions. It probably won't end
up mattering to anybody, but I see no reason not to take a position on
this stuff -- it doesn't cost us much.

An assertion (in the prefetching patch) now verifies that non-MVCC
scans never change scan direction. These scans are still allowed to
scan either forwards or backwards (because selfuncs.c expects to be
able to scan either way), but they are not allowed to change the
direction of a scan that's already underway.

It's fairly obvious that this cannot be allowed; what if the scan
changes direction while we're traversing a HOT chain? There's no
provision for reversing scan direction when stopped in the middle of a
HOT chain. Whereas it should be okay to "scan a HOT chain in the
forwards direction" during a backwards scan -- that still ought to
produce reasonable answers.

> Things I haven't done
> ---------------------
>
> * There have been recent CI failures on "linux-autoconf", that I
> haven't debugged.
>
> 001_aio.pl seems to reliably report "Dubious, test returned 4 (wstat
> 1024, 0x400)" on this CI target. I fully expect this v26 to fail when
> CFTester runs it through CI.

No progress here.

> * I've not given much thought to improving the cost model of index
> scans, to account for prefetching.

Today several of us (Robert Haas, Tomas, Andres, and I) discussed this
privately over IM. I hope that we'll have made concrete progress on it
before long, because it's really important.

I even think the main benefit of this work might be enabling the
planner to use plain scans (and index-only scans) more often in
general; this will be very beneficial even with workloads whose
working set fits in shared_buffers. Prefetching makes it safe to be
optimistic about the added overhead from random I/O, particularly
relative to equivalent bitmap scans, so no actual prefetching is
required for users to see significant benefits.

> * Continued to care about pg_attribute_always_inline making certain
> function prototypes overlong and ugly.
>
> Andres has a patch to shorten pg_attribute_always_inline to
> pg_always_inline in the open CF, which will presumably be committed
> soon. I don't want to forego prototypes; I am particular about the
> order in which function definitions appear. For now, ugly and overlong
> prototypes seem acceptable (I won't commit any patches with that
> issue).

Still waiting for Andres' pg_always_inline patch to get committed.

--
Peter Geoghegan

Attachment Content-Type Size
v27-0007-Allow-read_stream_reset-to-not-wait-for-IO-compl.patch application/octet-stream 20.5 KB
v27-0009-WIP-aio-bufmgr-Fix-race-condition-leading-to-dea.patch application/octet-stream 3.1 KB
v27-0001-Add-slot-based-table-AM-index-scan-interface.patch application/octet-stream 128.8 KB
v27-0008-aio-Fix-pgaio_io_wait-for-staged-IOs-B.patch application/octet-stream 6.3 KB
v27-0006-heapam-Add-index-scan-I-O-prefetching.patch application/octet-stream 54.7 KB
v27-0003-Adopt-amgetbatch-interface-in-hash-index-AM.patch application/octet-stream 47.1 KB
v27-0004-WIP-Adopt-amgetbatch-interface-in-GiST-index-AM.patch application/octet-stream 102.5 KB
v27-0005-heapam-Optimize-pin-transfers-during-index-scans.patch application/octet-stream 6.6 KB
v27-0002-Add-amgetbatch-interface-and-adopt-it-in-nbtree.patch application/octet-stream 261.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Andrey Borodin 2026-06-14 18:48:48 Re: Add wait events for server logging destination writes