Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-11-10 23:59:07
Message-ID:	CAH2-Wzk9=x=a2TbcqYcX+XXmDHQr5=1v9m4Z_v8a-KwF1Zoz0A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Nothing really new here (I've been
> working on batching on the table AM side, but nothing to show on that
> just yet).

Tomas and I had a meeting on Friday to discuss a way forward with this
project. Progress has stalled, and we feel that now is a good time to
pivot by refactoring the patch into smaller, independently
useful/committable pieces. This email explains our current thinking
(Tomas should correct me if I get anything wrong here).

The short version/executive summary
===================================

The need to get everything done in one single release seems to be
hampering progress. We made quick progress for a few months, but now
that we've exhausted the easy wins, the layering issues that remain
are making every remaining open item near intractable.

The layering issues make it very hard to keep on top of all of the
regressions; we're just doing too much at once. We're trying to manage
all of the regressions from the addition of prefetching/a heapam read
stream, while also trying to manage the regressions from moving index
AMs from the old amgettuple interface to the new amgetbatch interface.
And we still need to revise the table AM to move the read stream from
indexam.c over to the table AM side (this isn't in the latest version
of the patch at all).

Just making these AM interface changes is already a huge project on
its own. This makes it hard to focus on just a few things at any one
time; everything is interdependent. We seem to end up playing
whack-a-mole whenever we try to zero in on any single problem; we end
up going in circles.

The new tentative plan is to cut scope by focussing on switching over
to the new index AM + table AM interface from the patch in the short
term, for Postgres 19. There is an almost immediate benefit to just
doing that much, unrelated to I/O prefetching for index scans: it
enables batching of heap page buffer locking/unlocking (during the
point where index scans perform heap_hot_search_buffer calls) on the
table AM/heapam side during ordered index scans. That can dramatically
cut down on repeat buffer locking and unlocking, giving us enough of a
win (more details below) to be the sole justification for switching
over to the new set of AM interfaces for Postgres 19.

Our long term goals won't change under this phased approach, but our
timeline/short term focus certainly will. We hope to get some general
feedback about this new strategy for the project now, particularly
from Andres. The main practical concern is managing project risk
sensibly.

Difficulties with refactoring AM interfaces while introducing a read stream
===========================================================================

The uncertainty about how to resolve some of the remaining individual
open items for the project (specifically concerns about I/O
prefetching/read stream + resource management concerns, and how they
*interact* with broader layering questions) is the main blocker to
further progress. I'll now give a specific example of what I mean by
this, just because it's likely to be clearer than explaining the
underlying problem in general terms.

Currently, we can have up to 64 leaf-page-wise batches. Usually this
is more than enough, but occasionally we run out of space for batches,
and have to reset the read stream. This is certainly a kludge; we
discard pinned buffers with useful data in order to work around what
we've thought of as an implementation deficiency on the read stream
side up until now. Obviously just discarding useful work like that is
never okay (nobody will argue with me on this, I'm sure).

At various points we talked about addressing this particular problem
by teaching the read stream to "pause" such that we can consume those
remaining pinned buffers as needed, without consuming even more heap
pages/buffers to do so (there's no natural upper bound on those, I
think). We'd then "unpause" and resume prefetching again, once we
managed to free some more leaf-page-wise batches up. But I'm now
starting to have serious doubts about this approach (or at least
doubts about the approach that I think other people have in mind when
they talk about this kind of "pausing").

Again, it's really hard to pin down *where* we should be fixing things.

It occurs to me that it doesn't make much sense that the table
AM/indexam.c has *no* awareness of how many heap buffers are already
pinned on its behalf. The fact that that knowledge is *exclusively*
confined to the read stream isn't actually good. What we really need
to do is to care about all buffer pins held by the whole index scan
node, whether for index pages or for heap pages (though note that
holding onto buffer pins on index pages should be rare in practice).
We need to directly acknowledge the tension that exists between heapam
and index AM needs, I think.

The read stream needs to be involved in this process, but it should be
a 2-way conversation. The read stream already defensively checks
externally held buffer pins, which might kinda work for what we have
in mind -- but probably not. It seems bad to depend on what is
supposed to be a defensive measure for all this.

Separately, we'll probably eventually want the heapam side to be able
to notice that a block number that it requests is already in the
pending list, so that it can be marked as a duplicate (and so not
unpinned until the duplicate request is also satisfied/has its heap
tuples returned to the scan). That's another factor pushing things in
this general direction. (Less important, but noted here for
completeness.)

I've been talking about problems when 64 leaf-page-wise batches isn't
enough, which is rare in practice. It's far more common for 64 to be
too *many* batches, which wastes memory (e.g, with largely sequential
heap access we seem to need no more than 5 or 10 at a time, even when
prefetching is really important). But it's hard to see how we could
lazily allocate memory used for batches under anything like the
current structure. It's circular: we should only allocate more
leaf-page-wise batches to make it possible to do more useful heap
prefetching. But right now heap prefetching will stall (or will
"pause" in its own kludgey way) precisely because there aren't enough
leaf-page-wise batches!

Granted, adding a "pausing" capability might be useful elsewhere. But
that in itself doesn't justify the general idea of pausing in the
specific way that index prefetching requires. Why should it?

Why should we pause when we've filled 64 leaf-page-wise batches
instead of 5 or 10 or 1000? ISTM that we're tacitly assuming that the
total number of usable leaf-page-wise batches remaining is a useful
proxy for the costs that actually matter. But why should it be? 64 is
just a number that we picked fairly arbitrarily, and one that has only
a weak relationship with more concrete costs such as leaf page buffer
pins held (as noted already, needing to hold onto a leaf page buffer
pin until we call btfreebatch against its batch isn't actually needed
during most index scans, but there will be exceptions).

My gut instinct is that this stuff will actually matter, in practice,
at least some of the time. And that that'll necessitate giving the
implementation a clear and complete picture of costs and benefits when
scheduling index scans that prefetch. Pausing can't be based on some
randomly chosen magic number, like 64, since that's bound to be
totally wrong in a nonzero number of cases.

ISTM that we cannot subordinate the table AM to the read stream. But
we also can't subordinate the read stream to the table AM. Figuring
all that out is hard. This is the kind of problem that we'd like to
defer for now.

Minor caveat: I'm not sure that Tomas endorses everything I've said
here about "pausing" the read stream. But that probably doesn't matter
much. Either way, these kinds of questions still weigh on the project,
and something should be done about it now, to keep things on course.

Phased approach
===============

As touched upon at the start of this email, under this new phased
approach to the project, the short term goal is to make heapam avoid
repeat buffer locks during index scans where that's clearly avoidable.
Making that much work shares many of the same problems with I/O
prefetching (particularly the basics of layering/AM revisions), but
defers dealing with the thorniest issues with pin resource management.
That's what I'll talk about here --- what we can defer, and what we
cannot defer.

But first, on a more positive note, I'll talk about the short term
benefits. My early prototype of the "only lock heap buffer once per
group of TIDs that point to the same heap page returned from an index
scan" optimization has been shown to improve throughput for large-ish
range scans by quite a bit. Variants of pgbench select with queries
like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500"
show improvements in throughput of up to 20% (and show similar
reductions in query latency). That's a nice win, all on its own.

Now back to talking about risks. There's still a lot of complexity
that cannot be deferred with this phased approach. We must still
switch over index AMs from amgettuple to the new amgetbatch interface.
And, we need to make the table AM interface used by index scans higher
level: index_getnext_slot would directly call a new table-AM-wise
callback, just passing it its own ScanDirection argument directly --
we wouldn't be passing TIDs to the table AM anymore.

The new table AM batch interface would work in terms of "give me the
next tuple in the current scan direction", not in terms of "give me
this random TID, which you know nothing else about". The table AM
becomes directly aware of the fact that it is participating in an
ordered index scan. This design is amenable to allowing the table AM
to see which accesses will be required in the near future -- that
requirement is common to both I/O prefetching and this other heap
buffer lock optimization.

It's even more complicated than just those changes to the index AM and
table AM interfaces: we'll also require that the table AM directly
interfaces with another layer that manages leaf-page-wise batches on
its behalf. They need to *cooperate* with each other, to a certain
degree. The executor proper won't call amgetbatch directly under this
scheme (it'd just provide a library of routines that help table AMs to
do so on their own).

That much doesn't seem deferrable. And it's hard. So this phased
approach certainly doesn't eliminate project risk, by any stretch of
the imagination. Offhand, I'd estimate that taking this phased
approach cuts the number of blockers to making an initial commit in
half.

Here's a nonexhaustive list of notable pain points that *won't* need
to be addressed in the short term, under this new approach/structure
(I'm somewhat repeating myself here):

* Most regressions are likely much easier to avoid/are automatically
avoided. Particularly with selective point query scans.

* No need to integrate the read stream, no need to solve most resource
management problems (the prior item about regressions is very much
related to this one).

* No need for streamPos stuff when iterating through TIDs from a
leaf-page-wise batch (only need readPos now). There's no need to keep
those 2 things in sync, because there'll only be 1 thing now.

Here's a nonexhaustive list of problems that we *will* still need to
solve in the earliest committed patch, under this phased approach
(again, I'm repeating myself somewhat):

* Actually integrating the amgetbatch interface in a way that is future-proof.

* Revising the table AM interface such that the table AM is directly
aware of the fact that it is feeding heap/table tuples to an ordered
index scan. That's a big conceptual shift for table AMs.

* Making the prior 2 changes "fit together" sensibly, in a way that
considers current and future needs. Also a big shift.

The "only lock heap buffer once per group of TIDs that point to the
same heap page returned from an index scan" optimization still
requires some general awareness of index AM costs on the table AM
side.

It only makes sense for us to batch-up extra TIDs (from the same heap
page) when determining which TIDs are about to be accessed as a group
isn't too expensive/the information is readily available to the table
AM, because it requested it from the index AM itself. We're setting a
new precedent by saying that it's okay to share certain knowledge
across what we previously thought of as strictly separate layers of
abstraction. I think that that makes sense (what else could possibly
work?), but I want to draw specific attention to that now.

* We'll still need index-only scans to do things in a way that
prevents inconsistencies/changing our mind in terms of which TIDs are
all-visible.

This has the advantage of allowing us to avoid accessing the
visibility map from the executor proper, which is an existing
modularity violation that we already agree ought to be fixed. This
will also keep us honest (we won't be deferring more than we should).
But that's not why I think it's essential to move VM accesses into the
table AM.

We should only batch together accesses to a heap page when we know for
sure that those TIDs will in fact be accessed. How are we supposed to
have general and robust handling for all that, in a world where the
visibility map continues to be accessed from the executor proper? At
best, not handling VM integration comprehensively (for index-only
scans) ties our hands around reordering work, and seems like it'd be
very brittle. It would likely have similar problems to our current
problems with managing a read stream in indexam.c, while relying on
tacit knowledge of how precisely those same heap blocks will later
actually be accessed from the heapam side.

The sensible solution is to put control of the scan's progress all in
one place. We don't want to have to worry about what happens when the
VM is concurrently set or unset.

When Andres and Tomas talk about table AM modularity stuff, they tend
to focus on why it's bad that the table AM interface uses heap TIDs
specifically. I agree with all that. But even if I didn't, everything
that I just said about the need to centralize control in the table AM
would still be true. That's why I'm focussing on that here (it's
really pretty subtle).

That's all I have for now. My thoughts here should be considered
tentative; I want to put my thinking on a more rigorous footing before
really committing to this new phased approach.

--
Peter Geoghegan

In response to

Re: index prefetching at 2025-11-02 23:49:53 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Chao Li	2025-11-11 00:43:25	Re: Fix a minor typo in the comment of read_stream_start_pending
Previous Message	Michael Paquier	2025-11-10 23:34:03	Re: Add tests for object size limits of injection points