Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-11-12 17:39:05
Message-ID: 07783ccc-0c85-4c0c-817f-faa5a2c38e64@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/11/25 00:59, Peter Geoghegan wrote:
> On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> Nothing really new here (I've been
>> working on batching on the table AM side, but nothing to show on that
>> just yet).
>
> Tomas and I had a meeting on Friday to discuss a way forward with this
> project. Progress has stalled, and we feel that now is a good time to
> pivot by refactoring the patch into smaller, independently
> useful/committable pieces. This email explains our current thinking
> (Tomas should correct me if I get anything wrong here).
>
> The short version/executive summary
> ===================================
>
> The need to get everything done in one single release seems to be
> hampering progress. We made quick progress for a few months, but now
> that we've exhausted the easy wins, the layering issues that remain
> are making every remaining open item near intractable.
>
> The layering issues make it very hard to keep on top of all of the
> regressions; we're just doing too much at once. We're trying to manage
> all of the regressions from the addition of prefetching/a heapam read
> stream, while also trying to manage the regressions from moving index
> AMs from the old amgettuple interface to the new amgetbatch interface.
> And we still need to revise the table AM to move the read stream from
> indexam.c over to the table AM side (this isn't in the latest version
> of the patch at all).
>
> Just making these AM interface changes is already a huge project on
> its own. This makes it hard to focus on just a few things at any one
> time; everything is interdependent. We seem to end up playing
> whack-a-mole whenever we try to zero in on any single problem; we end
> up going in circles.
>
> The new tentative plan is to cut scope by focussing on switching over
> to the new index AM + table AM interface from the patch in the short
> term, for Postgres 19. There is an almost immediate benefit to just
> doing that much, unrelated to I/O prefetching for index scans: it
> enables batching of heap page buffer locking/unlocking (during the
> point where index scans perform heap_hot_search_buffer calls) on the
> table AM/heapam side during ordered index scans. That can dramatically
> cut down on repeat buffer locking and unlocking, giving us enough of a
> win (more details below) to be the sole justification for switching
> over to the new set of AM interfaces for Postgres 19.
>
> Our long term goals won't change under this phased approach, but our
> timeline/short term focus certainly will. We hope to get some general
> feedback about this new strategy for the project now, particularly
> from Andres. The main practical concern is managing project risk
> sensibly.
>
> Difficulties with refactoring AM interfaces while introducing a read stream
> ===========================================================================
>
> The uncertainty about how to resolve some of the remaining individual
> open items for the project (specifically concerns about I/O
> prefetching/read stream + resource management concerns, and how they
> *interact* with broader layering questions) is the main blocker to
> further progress. I'll now give a specific example of what I mean by
> this, just because it's likely to be clearer than explaining the
> underlying problem in general terms.
>
> Currently, we can have up to 64 leaf-page-wise batches. Usually this
> is more than enough, but occasionally we run out of space for batches,
> and have to reset the read stream. This is certainly a kludge; we
> discard pinned buffers with useful data in order to work around what
> we've thought of as an implementation deficiency on the read stream
> side up until now. Obviously just discarding useful work like that is
> never okay (nobody will argue with me on this, I'm sure).
>
> At various points we talked about addressing this particular problem
> by teaching the read stream to "pause" such that we can consume those
> remaining pinned buffers as needed, without consuming even more heap
> pages/buffers to do so (there's no natural upper bound on those, I
> think). We'd then "unpause" and resume prefetching again, once we
> managed to free some more leaf-page-wise batches up. But I'm now
> starting to have serious doubts about this approach (or at least
> doubts about the approach that I think other people have in mind when
> they talk about this kind of "pausing").
>
> Again, it's really hard to pin down *where* we should be fixing things.
>
> It occurs to me that it doesn't make much sense that the table
> AM/indexam.c has *no* awareness of how many heap buffers are already
> pinned on its behalf. The fact that that knowledge is *exclusively*
> confined to the read stream isn't actually good. What we really need
> to do is to care about all buffer pins held by the whole index scan
> node, whether for index pages or for heap pages (though note that
> holding onto buffer pins on index pages should be rare in practice).
> We need to directly acknowledge the tension that exists between heapam
> and index AM needs, I think.
>
> The read stream needs to be involved in this process, but it should be
> a 2-way conversation. The read stream already defensively checks
> externally held buffer pins, which might kinda work for what we have
> in mind -- but probably not. It seems bad to depend on what is
> supposed to be a defensive measure for all this.
>
> Separately, we'll probably eventually want the heapam side to be able
> to notice that a block number that it requests is already in the
> pending list, so that it can be marked as a duplicate (and so not
> unpinned until the duplicate request is also satisfied/has its heap
> tuples returned to the scan). That's another factor pushing things in
> this general direction. (Less important, but noted here for
> completeness.)
>
> I've been talking about problems when 64 leaf-page-wise batches isn't
> enough, which is rare in practice. It's far more common for 64 to be
> too *many* batches, which wastes memory (e.g, with largely sequential
> heap access we seem to need no more than 5 or 10 at a time, even when
> prefetching is really important). But it's hard to see how we could
> lazily allocate memory used for batches under anything like the
> current structure. It's circular: we should only allocate more
> leaf-page-wise batches to make it possible to do more useful heap
> prefetching. But right now heap prefetching will stall (or will
> "pause" in its own kludgey way) precisely because there aren't enough
> leaf-page-wise batches!
>
> Granted, adding a "pausing" capability might be useful elsewhere. But
> that in itself doesn't justify the general idea of pausing in the
> specific way that index prefetching requires. Why should it?
>
> Why should we pause when we've filled 64 leaf-page-wise batches
> instead of 5 or 10 or 1000? ISTM that we're tacitly assuming that the
> total number of usable leaf-page-wise batches remaining is a useful
> proxy for the costs that actually matter. But why should it be? 64 is
> just a number that we picked fairly arbitrarily, and one that has only
> a weak relationship with more concrete costs such as leaf page buffer
> pins held (as noted already, needing to hold onto a leaf page buffer
> pin until we call btfreebatch against its batch isn't actually needed
> during most index scans, but there will be exceptions).
>
> My gut instinct is that this stuff will actually matter, in practice,
> at least some of the time. And that that'll necessitate giving the
> implementation a clear and complete picture of costs and benefits when
> scheduling index scans that prefetch. Pausing can't be based on some
> randomly chosen magic number, like 64, since that's bound to be
> totally wrong in a nonzero number of cases.
>
> ISTM that we cannot subordinate the table AM to the read stream. But
> we also can't subordinate the read stream to the table AM. Figuring
> all that out is hard. This is the kind of problem that we'd like to
> defer for now.
>
> Minor caveat: I'm not sure that Tomas endorses everything I've said
> here about "pausing" the read stream. But that probably doesn't matter
> much. Either way, these kinds of questions still weigh on the project,
> and something should be done about it now, to keep things on course.
>

I think I generally agree with what you said here about the challenges,
although it's a bit too abstract to respond to individual parts. I just
don't know how to rework the design to resolve this ...

For the reads stream "pausing" I think it's pretty clear it's more a
workaround than a desired behavior. We only pause the stream because we
need to limit the look-ahead distance (measured in index leaf pages),
and the read_stream has no such concept. It only knows about heap pins,
but e.g. IOS may need to read many leaf pages to find a single heap page
to prefetch. And the leaf pages are invisible to the stream.

The limit of 64 batches is entirely arbitrary. I needed a number that
would limit the amount of memory and time wasted on useless look-ahead,
and 64 seemed "reasonable" (not too high, but enough to not be hit very
often). Originally there was a fixed-length queue of batches, and 64 was
the capacity, but we no longer do it that way. So it's an imperfect
safety measure against "runaway" streams.

I don't want to get into too much detail about this particular issue,
it's already discussed somewhere in this thread. But if there was a way
to "tell" the read stream how much effort to spend looking ahead, we
wouldn't do the pausing (not in the end+reset way).

> Phased approach
> ===============
>
> As touched upon at the start of this email, under this new phased
> approach to the project, the short term goal is to make heapam avoid
> repeat buffer locks during index scans where that's clearly avoidable.
> Making that much work shares many of the same problems with I/O
> prefetching (particularly the basics of layering/AM revisions), but
> defers dealing with the thorniest issues with pin resource management.
> That's what I'll talk about here --- what we can defer, and what we
> cannot defer.
>
> But first, on a more positive note, I'll talk about the short term
> benefits. My early prototype of the "only lock heap buffer once per
> group of TIDs that point to the same heap page returned from an index
> scan" optimization has been shown to improve throughput for large-ish
> range scans by quite a bit. Variants of pgbench select with queries
> like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500"
> show improvements in throughput of up to 20% (and show similar
> reductions in query latency). That's a nice win, all on its own.
>
> Now back to talking about risks. There's still a lot of complexity
> that cannot be deferred with this phased approach. We must still
> switch over index AMs from amgettuple to the new amgetbatch interface.
> And, we need to make the table AM interface used by index scans higher
> level: index_getnext_slot would directly call a new table-AM-wise
> callback, just passing it its own ScanDirection argument directly --
> we wouldn't be passing TIDs to the table AM anymore.
>
> The new table AM batch interface would work in terms of "give me the
> next tuple in the current scan direction", not in terms of "give me
> this random TID, which you know nothing else about". The table AM
> becomes directly aware of the fact that it is participating in an
> ordered index scan. This design is amenable to allowing the table AM
> to see which accesses will be required in the near future -- that
> requirement is common to both I/O prefetching and this other heap
> buffer lock optimization.
>
> It's even more complicated than just those changes to the index AM and
> table AM interfaces: we'll also require that the table AM directly
> interfaces with another layer that manages leaf-page-wise batches on
> its behalf. They need to *cooperate* with each other, to a certain
> degree. The executor proper won't call amgetbatch directly under this
> scheme (it'd just provide a library of routines that help table AMs to
> do so on their own).
>
> That much doesn't seem deferrable. And it's hard. So this phased
> approach certainly doesn't eliminate project risk, by any stretch of
> the imagination. Offhand, I'd estimate that taking this phased
> approach cuts the number of blockers to making an initial commit in
> half.
>
> Here's a nonexhaustive list of notable pain points that *won't* need
> to be addressed in the short term, under this new approach/structure
> (I'm somewhat repeating myself here):
>
> * Most regressions are likely much easier to avoid/are automatically
> avoided. Particularly with selective point query scans.
>
> * No need to integrate the read stream, no need to solve most resource
> management problems (the prior item about regressions is very much
> related to this one).
>
> * No need for streamPos stuff when iterating through TIDs from a
> leaf-page-wise batch (only need readPos now). There's no need to keep
> those 2 things in sync, because there'll only be 1 thing now.
>
> Here's a nonexhaustive list of problems that we *will* still need to
> solve in the earliest committed patch, under this phased approach
> (again, I'm repeating myself somewhat):
>
> * Actually integrating the amgetbatch interface in a way that is future-proof.
>
> * Revising the table AM interface such that the table AM is directly
> aware of the fact that it is feeding heap/table tuples to an ordered
> index scan. That's a big conceptual shift for table AMs.
>
> * Making the prior 2 changes "fit together" sensibly, in a way that
> considers current and future needs. Also a big shift.
>
> The "only lock heap buffer once per group of TIDs that point to the
> same heap page returned from an index scan" optimization still
> requires some general awareness of index AM costs on the table AM
> side.
>
> It only makes sense for us to batch-up extra TIDs (from the same heap
> page) when determining which TIDs are about to be accessed as a group
> isn't too expensive/the information is readily available to the table
> AM, because it requested it from the index AM itself. We're setting a
> new precedent by saying that it's okay to share certain knowledge
> across what we previously thought of as strictly separate layers of
> abstraction. I think that that makes sense (what else could possibly
> work?), but I want to draw specific attention to that now.
>
> * We'll still need index-only scans to do things in a way that
> prevents inconsistencies/changing our mind in terms of which TIDs are
> all-visible.
>
> This has the advantage of allowing us to avoid accessing the
> visibility map from the executor proper, which is an existing
> modularity violation that we already agree ought to be fixed. This
> will also keep us honest (we won't be deferring more than we should).
> But that's not why I think it's essential to move VM accesses into the
> table AM.
>
> We should only batch together accesses to a heap page when we know for
> sure that those TIDs will in fact be accessed. How are we supposed to
> have general and robust handling for all that, in a world where the
> visibility map continues to be accessed from the executor proper? At
> best, not handling VM integration comprehensively (for index-only
> scans) ties our hands around reordering work, and seems like it'd be
> very brittle. It would likely have similar problems to our current
> problems with managing a read stream in indexam.c, while relying on
> tacit knowledge of how precisely those same heap blocks will later
> actually be accessed from the heapam side.
>
> The sensible solution is to put control of the scan's progress all in
> one place. We don't want to have to worry about what happens when the
> VM is concurrently set or unset.
>
> When Andres and Tomas talk about table AM modularity stuff, they tend
> to focus on why it's bad that the table AM interface uses heap TIDs
> specifically. I agree with all that. But even if I didn't, everything
> that I just said about the need to centralize control in the table AM
> would still be true. That's why I'm focussing on that here (it's
> really pretty subtle).
>
> That's all I have for now. My thoughts here should be considered
> tentative; I want to put my thinking on a more rigorous footing before
> really committing to this new phased approach.
>

I don't object to the "phased approach" with doing the batching first,
but without seeing the code I can't really say if/how much it helps with
resolving the design/layering questions. It feels a bit too abstract to
me. While working on the prefetching I moved the code between layers
about three times, and I'm still not quite sure which layer should be
responsible for which piece :-(

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-11-12 17:47:36 Re: Extended test coverage and docs for SSL passphrase commands
Previous Message Daniel Gustafsson 2025-11-12 17:25:45 Re: Extended test coverage and docs for SSL passphrase commands