Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-11-14 18:19:08
Message-ID: CAH2-Wz=PR6rUniOwyeo+BcR62fVZU5UR4ci0UH4kZEuaY_td8A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 12, 2025 at 12:39 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I think I generally agree with what you said here about the challenges,
> although it's a bit too abstract to respond to individual parts. I just
> don't know how to rework the design to resolve this ...

I'm trying to identify which subsets of the existing design can
reasonably be committed in a single release (while acknowledging that
even those subsets will need to be reworked). That is more abstract
than any of us would like -- no question.

What are we most confident will definitely be useful to prefetching,
that also enables the "only lock heap buffer once per group of TIDs
that point to the same heap page returned from an index scan"
optimization? I'm trying to reach a tentative agreement that just
doing the amgetbatch revisions and the table AM revisions (to do the
other heap buffer lock optimization) will represent useful progress
that can be committed in a single release. And on what the specifics
of the table AM revisions will need to be, to get us to a patch that
we can commit to Postgres 19.

> For the reads stream "pausing" I think it's pretty clear it's more a
> workaround than a desired behavior. We only pause the stream because we
> need to limit the look-ahead distance (measured in index leaf pages),
> and the read_stream has no such concept. It only knows about heap pins,
> but e.g. IOS may need to read many leaf pages to find a single heap page
> to prefetch. And the leaf pages are invisible to the stream.

Right. But we seemed to talk about this as if the implementation of
"pausing" was the problem. I was suggesting that the general idea of
pausing might well be the wrong one -- at least when applied in
anything like the way we currently apply it.

More importantly, I feel that it'll be really hard to get a clear
answer to that particular question (and a couple of others like it)
without first getting clarity on what we need from the table AM at a
high level, API-wise. Bearing in mind that we've made no real progress
on that at all.

We all agree that it's bad that indexam.c tacitly coordinates with
heapam in the way it does in the current patch. And that assuming a
TID representation in the API is bad. But that isn't very satisfying
to me; it's too focussed on that one really obvious and glaring
problem, and what we *don't* want. There's been very little (almost
nothing) on this thread about what we actually *do* want. That's the
thing that's still way to abstract, that I'd like to make more
concrete.

As you know, I think that we should add a new table AM interface that
makes the table AM directly aware of the fact that it is feeding an
ordered index scan, completely avoiding the use of TIDs (as well as
avoiding *any* more abstract representation of a table AM tuple
identifier). In other words, I think that we should just fully admit
the fact that the table AM is in control of the scan, and all that
comes with it. The table AM will have to directly coordinate with the
index AM in a way that's quite different to what we do right now.

I don't think that anybody else has really said much about that idea,
at least on the list. Is it a reasonable approach to take? This is
really important, especially in the short term/for Postgres 19.

> The limit of 64 batches is entirely arbitrary. I needed a number that
> would limit the amount of memory and time wasted on useless look-ahead,
> and 64 seemed "reasonable" (not too high, but enough to not be hit very
> often). Originally there was a fixed-length queue of batches, and 64 was
> the capacity, but we no longer do it that way. So it's an imperfect
> safety measure against "runaway" streams.

Right, but we still max out at 64. And then we stay there. It just
feels unprincipled to me.

> I don't want to get into too much detail about this particular issue,
> it's already discussed somewhere in this thread. But if there was a way
> to "tell" the read stream how much effort to spend looking ahead, we
> wouldn't do the pausing (not in the end+reset way).

I don't want to get into that again either. It was just an example of
the kinds of problems we're running into. Though a particularly good
example IMV.

> > That's all I have for now. My thoughts here should be considered
> > tentative; I want to put my thinking on a more rigorous footing before
> > really committing to this new phased approach.
> >
>
> I don't object to the "phased approach" with doing the batching first,
> but without seeing the code I can't really say if/how much it helps with
> resolving the design/layering questions. It feels a bit too abstract to
> me.

It is in no small part based on gut feeling and intuition. I don't
have anything better to go on right now. It's a really difficult
project.

> While working on the prefetching I moved the code between layers
> about three times, and I'm still not quite sure which layer should be
> responsible for which piece :-(

I don't think that this is quite the same situation.

The index prefetching design was completely overhauled twice now, but
on both occasions that was driven by some clear goal/the need to fix
some problem with the prior design. The first time it was due to the
fact that the original version didn't work with kill_prior_tuple. The
second time was due to the need to support reading index pages that
were ahead of the current page that the scan is returning tuples from.
Granted, it took a while to actually prove that the second overhaul
(which created the third major redesign) was the right direction to
take things in, but testing did eventually make that quite clear.

I don't see this as doing the same thing a third time/creating a forth
design from scratch. It's more of a refinement (albeit quite a big
one) of the most recent design. And in a direction that doesn't seem
too surprising to me. We knew that the table AM side of the most
recent redesign still had plenty of problems. We should have been a
bit more focussed on that side of things earlier on.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Geier 2025-11-14 18:20:12 Re: Performance issues with parallelism and LIMIT
Previous Message David Geier 2025-11-14 18:04:57 Re: tuple radix sort