Quick Links

Re: index prefetching

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-12-01 16:32:30
Message-ID:	138c329d-e8b0-438c-91e8-30ed6e47f0ac@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/1/25 02:23, Peter Geoghegan wrote:
> On Mon, Nov 10, 2025 at 6:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> The new tentative plan is to cut scope by focussing on switching over
>> to the new index AM + table AM interface from the patch in the short
>> term, for Postgres 19.
>
> Attached patch makes the table AM revisions we talked about. This is a
> significant change in direction, so I'm adopting a new patch
> versioning scheme: this new version is v1. (I just find it easier to
> deal with sequential patch version numbers.)
>

Thanks for the new version! I like the layering in this patch, moving
some of the stuff from indexam.c/executor to table AM. It makes some of
the code much cleaner, I think.

> I'm sure that I'll have made numerous mistakes in this new v1. There
> will certainly be some bugs, and some of the exact details of how I'm
> doing the layering are likely suboptimal or even wrong. I am
> nevertheless cautiously optimistic that this will be the last major
> redesign that will be required for this project.
>

Sounds good. FWIW I don't see any major issues in this version.

>> There is an almost immediate benefit to just
>> doing that much, unrelated to I/O prefetching for index scans: it
>> enables batching of heap page buffer locking/unlocking (during the
>> point where index scans perform heap_hot_search_buffer calls) on the
>> table AM/heapam side during ordered index scans.
>
> Note that this new v1 doesn't yet include the important heapam buffer
> locking optimization discussed here. It didn't seem worth holding up
> everything just for that. Plan is to get to it next.
>
> (It isn't intrinsically all that complicated to add the optimization
> with this new table AM orientated structure, but doing so would have
> made performance validation work/avoiding regressions with simple
> queries that much harder. So I just put it off for a bit longer.)
>

Understood. I presume that optimization fits mostly "seamlessly" into
this patch design.

> What's new in v1 (compared to v20251109-*, the prior version):
>
> * The first patch in the series is now mostly about changing the table
> AM and index AM in a complementary way (not just about adding the
> amgetbatch interface to the index AM).
>
> To summarize this point (mostly just a recap of recent discussion on
> the table AM API on this thread) with its own sub points:
> > - We're now using a slot-based table AM interface that understands
> scan direction. We now do all VM access for index-only scans on the
> heapam side, fixing that existing table AM modularity violation once
> and for all.
>

I admit I was a bit skeptical about this approach, mostly because I
didn't have a clear idea how would it work. But it turns out to be quite
clean. Well, definitely cleaner that what I had before.

> - Batches returned by amgetbatch are directly managed by heapam,
> giving it the direct control that it requires to get the best possible
> performance. Whether that's for adding I/O prefetching, or for other
> optimizations.
>
> - The old table_index_fetch_tuple index scan interface is still needed
> -- though only barely.
>
> The rule going forward for core executor code is that it should always
> use this new slot-based interface, unless there is a specific need for
> such a caller to pass *their own* TID, in a way that cannot possibly
> be delegated to our new high level table AM interface.
>
> For example, we still need table_index_fetch_tuple for nbtree's
> _bt_check_unique; it must pass TIDs to heapam, and get back tuples,
> without starting any new index scan to do so (the only "index scan"
> involved in the case of the _bt_check_unique caller takes place in the
> btinsert that needs to perform unique index enforcement in passing). I
> think it makes perfect sense that a small handful of special case
> callers still need to use table_index_fetch_tuple, since there really
> is no way around the need for these callers to pass their own TID.
>

Agreed. I don't think it makes sense to require eliminating all these
table_index_fetch_tuple calls (even if it was possible).

> * Major restructuring of batch management code, to allow it to work
> with the table AM interface (as well as related improvements and
> polishing).
>
> The parts of batch management that aren't under the direct control of
> table AMs/heapam (the batch helper functions that all table AMs will
> use) are no longer in indexam.c; there's a new file for those routines
> named indexbatch.c. indexbatch.c is also the place where a few other
> helper functions go. These other functions are called by indexam.c/the
> core executor, for things like initializing an amgetbatch scan, and
> informing nbtree that it is taking a mark (for mark/restore).
>
> Maybe there are certain remaining problems with the way that indexam.c
> and heapam_handler.c are coordinating across index scans. Hopefully
> the structure wasn't accidentally overfitted to heapam/isn't brittle
> in some other way.
>

+1, seems like a clear improvement. I don't see any major issues, but I
have a couple minor comments/questions.

I realize this also removes mark/restore support from the old
"amgettuple" interface, so only AMs implementing the new batching API
will be able to do mark/restore. AFAICS in core this only affects btree,
and that is switched to the batching. But won't this break external AMs
that might do mark/restore, and don't want to / can't do batching? I'm
not aware of any such AMs, though. Maybe it's fine.

Do we want to make "per-leaf-page" batches explicit in the commit
message / comments? Yes, we do create per-leaf batches, but isn't it
more because it's convenient, and the AM could create larger/smaller
batches if appropriate? Or is this a requirement? I'm thinking about
doing batching for gist/spgist ordered scans, where the index entries
are not returned leaf-at-a-time.

Another question about IOS with xs_hitup - which is not supported by
the patch (only IOS with xs_itup are). Is there a reason why this can't
be supported? I can't think of any, maybe I'm missing something?

There's also the question whether amgettuple/amgetbatch should be
exclusive, or an AM could support both. In the docs the patch seems to
imply it's exclusive, but then it also says "XXX uncertain" about this.

I suspect it probably makes sense to allow implementing both, with the
amgettuple as a "fallback" for scans where doing batching is complex and
unlikely to help (I'm thinking about spgist ordered scans, which reorder
the index entries). But maybe it makes sense to apply batching even to
this case, not sure.

If we choose to allow both, then something will have to make a decision
which of the APIs to use in a given scan. This decision probably needs
happen early in the planning, and can't be renegotiated. Ultimately the
only part aware of all the details is the AM opclass, so there'd need to
be some sort of optional AM procedure to decide this.

But maybe it's not worth it? I'm concerned about painting ourselves in
the corner, where some index AM can't do batching for one corner case,
and therefore it can't do batching at all.

> * Renamed and made lots of tweaks to batching related functions and
> structs. I've also integrated code that previously appeared in its own
> "batch cache" patch into the new main commit in the patch series (the
> first patch in the new series).
>
> The main goal of the tweaks to the data structures was to avoid
> indirection that previously caused small regressions in my
> microbenchmarks. We're very sensitive to costs from additional pointer
> chasing in these code paths. And from even small memory allocations.
>
> I think that I've avoided all regressions with just the first patch,
> at least for my own microbenchmark suite. I did not aim to avoid these
> regressions with the prefetching patch, since I consider it out of
> scope now (for Postgres 19).
>

I think it's much more consistent now, thanks.

I find the new "BatchIndexScan" name a bit confusing, it sounds more
like a special type of IndexScan. Maybe IndexScanBatch would be better?

Also, I find the batch_assert_pos_valid/batch_assert_batch_valid naming
a bit surprising. I think the custom is to name "asserts" function
something like AssertSomethingSomething(), to make it distinct from
usual functions. At least that's what I saw in other patches, and I
followed that practice ... But maybe it's not suitable for non-static
functions.

Speaking of batch_assert_batches_valid, why not to add it to relscan.h,
next to the other "asserts"?

> * v1 breaks out prefetching into its own patch, which is now the
> second patch in the patch series.
>
> The new I/O prefetching patch turned out to be surprisingly small. I
> still feel good about our choice to put that off until Postgres 20,
> though -- it's definitely where most of the difficulties are.
> Especially with things like resource management. (The problem with the
> second patch is that it's too small/doesn't address all the problems,
> not that it's too big and unwieldy.)
>

Let's see how that goes. I'm not against putting that off until Postgres
20, but maybe it's too early to make that decision. I'd like to at least
give it a try for 19. If it doesn't make it, that's fine.

> Prefetching works at least as well as it did in earlier versions
> (maybe even slightly better). It's not just an afterthought here. At a
> minimum, we need to continue to maintain prefetching in a reasonably
> complete and usable form to keep us honest about the design changes in
> the table AM and index AM APIs. If the design itself cannot eventually
> accommodate Postgres 20 work on I/O prefetching (and even later work),
> then it's no good.
>
> Minor caveat about preserving prefetching in good working order: I
> disabled support for index-only scans that use I/O prefetching for
> heap accesses in the second patch, at least for now. To recap, IoS
> support requires a visibility cache so that both readBatch and
> streamBatch agree on exactly which heap blocks will need to be read,
> even when the visibility map has some relevant heap page bits
> concurrently set or unset. It won't be too hard to add something like
> that back to heapam_handler.c, but I didn't get around to doing so
> just yet.
>

Right, I was wondering how's the patch dealing with that before I
realized it's disabled.

> It might be independently useful to have some kind of visibility
> cache, even without prefetching; batching VM accesses (say by doing
> them up front, for a whole batch, right after amgetbatch returns)
> might work out saving cycles with cached scans. You know, somewhat
> like how we'll do same-heap-page heap tuple fetches eagerly as a way
> of minimizing buffer lock/unlock traffic.
>

True.

> * There's a new patch that adds amgetbatch support for hash indexes.
>
> This demonstrates that the amgetbatch interface is already reasonably
> general. And that adding support to an index AM doesn't have to be all
> that invasive. I'm more focussed than ever on the generality of the
> API now.
>

Nice, and surprisingly small.

> * Added documentation that attempts to formalize the constraints that
> index AMs that opt to use amgetbatch are under.
>
> I don't think that it makes sense to think of amgettuple as the legacy
> interface for plain index scans. There will probably always be cases
> like KNN GiST scans, that legitimately need the index AM to directly
> control the progress of index scans, a tuple at a time.
>

Right.

> After all, these scan types give an absurd amount of control over many
> things to the index AM -- that seems to really make it hard to put the
> table AM in control of the scan's progress. For example, GiST scans
> use their own GISTSearchHeapItem struct to manage each item returned
> to the scan (which has a bunch of extra fields compared to our new
> AM-generic BatchMatchingItem struct). This GISTSearchHeapItem struct
> allows GiST to indicate whether or not an index tuple's quals must be
> rechecked. It works at the tuple granularity (individual GiST
> opclasses might expect that level of flexibility), which really works
> against the batching concepts that we're pursuing here.
>
> It's true that hash index scans are also lossy, but that's quite
> different: they're inherently lossy. It's not as if hash index scans
> are sometimes not lossy. They certainly cannot be lossy for some
> tuples but not other tuples that all get returned during the same
> index scan. Not so with GiST scans.
>
> Likely the best solution to the problems posed by GiST and SP-GiST
> will be to choose one of either amgettuple and amgetbatch during
> planning, according to what the scan actually requires (while having
> support for both interfaces in both index AMs). I'm still not sure
> what that should look like, though -- how does the planner know which
> interface to use, in a world where it has to make a choice with those
> index AMs that offer both? Obviously the answer depends in part on
> what actually matters to GiST/where GiST *can* reasonably use
> amgetbatch, to get benefits such as prefetching. And I don't claim to
> have a full understanding of that right now.
>

Right, that's pretty much what I suggested earlier.

> Here are the things that I'd like to ask from reviewers, and from Tomas:
>
> * Review of the table AM changes, with a particular emphasis on high
> level architectural choices.
>

To me the proposed architecture/layering looks nice and reasonable.

But I haven't really thought about table AM until Andres pointed out the
issues, so maybe I may not be the right person to judge this. And I've
moved the code between the various layers so many times I have vertigo.

> * Most importantly: will the approach in this new v1 avoid painting
> ourselves into a corner? It can be incomplete, as long as it doesn't
> block progress on things we're likely to want to do in the next couple
> of releases.
>

I don't see why we would paint ourselves in the corner with this. (I'm
ignoring the question about allowing only one of amgettuple/amgetbatch.)

> * Help with putting the contract that amgetbatch requires of index AMs
> on a more rigorous footing. In other words, is amgetbatch itself
> sufficiently general to accomodate the needs of index AMs in the
> future? I've made a start on that here (by adding sgml docs about the
> index AM API, which mentions table AM concerns), but work remains,
> particularly when it comes to supporting GiST + SP-GiST.
>

So what exactly is the "contract" assumed by the current patch? Do you
have any thoughts about it being too inflexible in some respect?

As mentioned earlier, maybe we shouldn't tie batches to leaf pages too
much, so that the AM can build batches not aligned to leaf-pages in such
a simple way. I think this would allow doing batchning/prefetching for
cases like the spgist ordered scans, etc.

> I think it makes sense to keep feedback mostly high level for now --
> to make it primarily about how the individual API changes fit
> together, if they're coordinating too much (or not enough), and if the
> interface we have is able to accommodate future needs.
>

Makes sense. Hopefully I wasn't nitpicking about details too much.

regards

--
Tomas Vondra

In response to

Re: index prefetching at 2025-12-01 01:23:56 from Peter Geoghegan

Responses

Re: index prefetching at 2025-12-02 00:00:45 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-12-01 16:47:43	Re: show size of DSAs and dshash tables in pg_dsm_registry_allocations
Previous Message	Peter Eisentraut	2025-12-01 16:08:15	Re: warning: dereferencing type-punned pointer