Re: index prefetching

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-03-29 06:15:55
Message-ID: CAHg+QDdsOe_7wyN_fdeKQmGkCTGkhnkSUCSNGOFWQZZTO0CdUQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter,

On Fri, Mar 27, 2026 at 5:36 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Sun, Mar 22, 2026 at 9:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > V17 is attached. This addresses most of your feedback, but defers
> > dealing with the trickier parts for now.
>
> Attached is v18, which addresses the tricky issues I skipped in v17.
>
> Highlights:
>
> * New commit (the first) that adds a heapam_indexscan.c file, and
> moves all of the relevant existing heapam_handler.c functions over to
> this new file. This commit is strictly mechanical, and has no
> functional impact.
>
> * Another new commit (the second) adds the new slot-based index scan
> interface, and pushes the VM lookups that currently take place in
> nodeIndexonlyScan.c into heapam. This includes the instrumentation
> changes that were previously in their own small patch. It also
> includes the VISITED_PAGES_LIMIT stuff -- there's no longer any commit
> that breaks VISITED_PAGES_LIMIT, nor is there a later commit that
> subsequently restores the functionality.
>
> * We now extract one key piece of infrastructure that other table AMs
> can now reuse to manage their own index scan batches: a new inline
> function called tableam_util_fetch_next_batch now appears as an
> indexbatch.h inline function.
>
> This function (which appeared under a different name in v17) manages
> deciding whether to call amgetbatch (performing the amgetbatch call
> when needed) or to simply return the next batch in line (because such
> a batch already exists/is already loaded in the ring buffer from
> earlier). It also detects and handles cross-batch changes in scan
> direction, and sets/reads the fields that track when a batch is
> definitely the last one in the given scan direction.
>
> I tried to go further by also placing the bulk of the code in
> heapam_index_getnext_scanbatch_pos in indexbatch.h, but ultimately
> concluded that it wasn't worth it. There isn't much code in
> heapam_index_getnext_scanbatch_pos to generalize. Some of that code
> coordinates quite directly with the heapam read stream callback.
>
> * Added an "isGuarded" field to IndexScanBatchData. That way we can
> assert that any batch loaded within the read stream callback is
> already unguarded/has no batch index page buffer pin still held. The
> field isn't just for assertions, though. We also rely on it in one or
> two places during release builds, which seemed clearer to me. For
> example, we no longer require that amunguardbatch routines be
> idempotent, which seems like a minor improvement.
>
> * Other work is broken out into its own commit, to better highlight
> issues that aren't strictly related to the amgetbatch changes.
> Specifically, a new commit adds the "don't unpin on rescan" behavior.
> These changes, which first appeared in the main amgetbatch commit, aim
> to minimize regressions with cached workloads (this was discussed
> briefly upthread, plus Andres and I discussed it over IM some weeks
> back).
>
> * Replaced what was previously an assert with an equivalent
> pg_assume(), within heapam_index_getnext_scanbatch_pos -- which is a
> very hot function. We now "pg_assume(batchringbuf->headBatch ==
> scanPos->batch)", allowing the compiler to exploit this existing
> invariant. This seems to help the compiler generate better code,
> presumably by allowing more efficient register allocation. I haven't
> looked at the disassembly just yet, but it helps with certain
> microbenchmarks enough to matter.
>
> * The "Add UnlockBufferGetLSN utility function" commit now uses an
> atomic operation (in the common case where it actually drops the index
> page's pin), just like UnlockReleaseBuffer itself following Andres'
> commit f39cb8c011062d65e146c1e9d1aae221e96d8320 from today.
>
> I completed this item quickly, after pulling from master a little
> while ago. The expectation for UnlockBufferGetLSN is that it be at
> least as good as a BufferGetLSNAtomic followed by a
> UnlockReleaseBuffer, so IMV it made sense to include this last minute
> addition in v18.
>
> * The "don't wait" patches are no longer included, since Andres
> committed them to the master branch just now. Thanks to both you
> (Andres) and Melanie for taking care of this very thorny issue!
>
> I believe I have now acted on all of Andres' feedback, with one minor
> exception: Andres disliked how we use 'if (scan->xs_heapfetch)' to
> determine whether to use the scan's batch cache (we don't want to use
> it at all if the scan is ending). I just ran out of steam, so this
> version doesn't address the problem. But I'm still tracking it as an
> open item (I don't disagree with Andres, but this problem is slightly
> awkward). Note that I *did* create a helper function here, so we at
> least avoid having 2 copies of the loop that looks for a free cache
> batch slot.
>

I am still reviewing / understanding the patch, a couple quick comments
based on my review.

1. Looks like off-by-one in the for loop in the
patch v18-0015-WIP-aio-io_uring-Use-IO-size-not-IO-queue-to-tri.patch

@@ -730,7 +711,39 @@ pgaio_uring_sq_from_io(PgAioHandle *ioh, struct
io_uring_sqe *sqe)
ioh->op_data.read.iov_length,
ioh->op_data.read.offset);

+ for (int i = 0; i <= ioh->op_data.read.iov_length; i++, iov++)
+ io_size += iov->iov_len;

2. NIT comment: _bt_endpoint return type changed to IndexScanBatch from
bool in the patch
v18-0005-Add-interfaces-that-enable-index-prefetching.patch.
But the places it is returning false would be good to replace with NULL.
Though false is treat as 0/NULL and no compiler errors, it improves the
readability of the code.

if (!BufferIsValid(btfirstbatch->buf))
{
/*
* Empty index. Lock the whole relation, as nothing finer to lock
* exists.
*/
PredicateLockRelation(rel, scan->xs_snapshot);
_bt_parallel_done(scan);
return false;
}

Thanks,
Satya

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Evgeny Voropaev 2026-03-29 04:53:48 Re: Compress prune/freeze records with Delta Frame of Reference algorithm