Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-07-18 21:44:26
Message-ID: CAH2-Wzk4-sfr35nJJahErj=tZqucBHaxQEOyvNjxjQ0MmF73Yw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 18, 2025 at 4:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I don't agree with that. For efficiency reasons alone table AMs should get a
> whole batch of TIDs at once. If you have an ordered indexscan that returns
> TIDs that are correlated with the table, we waste *tremendous* amount of
> cycles right now.

I agree, I think. But the terminology in this area can be confusing,
so let's make sure that we all understand each other:

I think that the table AM probably needs to have its own definition of
a batch (or some other distinct phrase/concept) -- it's not
necessarily the same group of TIDs that are associated with a batch on
the index AM side. (Within an index AM, there is a 1:1 correspondence
between batches and leaf pages, and batches need to hold on to a leaf
page buffer pin for a time. None of this should really matter to the
table AM.)

At a high level, the table AM (and/or its read stream) asks for so
many heap blocks/TIDs. Occasionally, index AM implementation details
(i.e. the fact that many index leaf pages have to be read to get very
few TIDs) will result in that request not being honored. The interface
that the table AM uses must therefore occasionally answer "I'm sorry,
I can only reasonably give you so many TIDs at this time". When that
happens, the table AM has to make do. That can be very temporary, or
it can happen again and again, depending on implementation details
known only to the index AM side (though typically it'll never happen
even once).

Does that sound roughly right to you? Obviously these details are
still somewhat hand-wavy -- I'm not fully sure of what the interface
should look like, by any means. But the important points are:

* The table AM drives the whole process.

* The table AM knows essentially nothing about leaf pages/index AM
batches -- it just has some general idea that sometimes it cannot have
its request honored, in which case it must make do.

* Some other layer represents the index AM -- though that layer
actually lives outside of index AMs (this is the code that the
"complex" patch currently puts in indexam.c). This other layer manages
resources (primarily leaf page buffer pins) on behalf of each index
AM. It also determines whether or not index AM implementation details
make it impractical to give the table AM exactly what it asked for
(this might actually require a small amount of cooperation from index
AM code, based on simple generic measures like leaf pages read).

* This other index AM layer does still know that it isn't cool to drop
leaf page buffer pins before we're done reading the corresponding heap
TIDs, due to heapam implementation details around making concurrent
heap TID recycling safe.

I'm not really sure how the table AM lets the new index AM layer know
"okay, done with all those TIDs now" in a way that is both correct (in
terms of avoiding unsafe concurrent TID recycling) and also gives the
table AM the freedom to do its own kind of batch access at the level
of heap pages. We don't necessarily have to figure all that out in the
first committed version, though.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2025-07-18 22:14:23 Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Previous Message Jacob Champion 2025-07-18 21:30:44 Re: libpq: Process buffered SSL read bytes to support records >8kB on async API