Quick Links

Re: index prefetching

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-07-18 20:52:29
Message-ID:	f2m3vqsnbnnmybnn3vyjy3t3t3ayw5yyatcbohs3adlzfxzsy2@wt32h37uyvpf
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-07-18 19:44:51 +0200, Tomas Vondra wrote:
> I agree tableam needs to have a say in this, so that it can interpret
> the TIDs in a way that fits how it actually stores data. But I'm not
> sure it should be responsible for calling index_batch_getnext(). Isn't
> the batching mostly an "implementation" detail of the index AM? That's
> how I was thinking about it, at least.

I don't agree with that. For efficiency reasons alone table AMs should get a
whole batch of TIDs at once. If you have an ordered indexscan that returns
TIDs that are correlated with the table, we waste *tremendous* amount of
cycles right now.

Instead of locking the page, doing a HOT search for every tuple, and then
unlocking the page, we lock and unlock the page for every single TID. The
locking alone is a significant overhead (it's like 25% of the cycles or so),
but what's worse, it reduces what out-of-order execution can do to hide
cache-misses.

Even leaving locking overhead and out-of-order execution aside, there's a good
bit of constant overhead work in heap_hot_search_buffer() that can be avoided
by doing the work all at once.

Just to show how big that effect is, I hacked up a patch that holds the buffer
lock from when the buffer is first pinned in heapam_index_fetch_tuple() until
another buffer is pinned, or until the scan ends. That's totally not a valid
change due to holding the lock for far too long, but it's a decent
approximation of the gain of reducing the locking. This query
SELECT * FROM lineitem ORDER BY l_orderkey OFFSET 10000000 LIMIT 1;
speeds up by 28%. Of course that's an extreme case, but still.

That likely undersells the gain, because the out-of-order benefits aren't
really there due to all the other code that runs inbetween two
heap_hot_search_buffer() calls. It obviously also doesn't show any of the
amortization benefits.

IMO the flow really should be something like this:

IndexScan executor node
-> table "index" scan using the passed in IndexScanDesc
-> read stream doing readahead for all the required heap blocks
-> table AM next page callback
-> index scans returning batches

I think the way that IndexOnlyScan works today (independent of this patch)
really is a layering violation. It "knows" about the way the visibilitymap,
which it really has no business accessing, that's a heap specific thing. It
also knows too much about different formats that can be stored by indexes, but
that's kind of a separate issue.

Greetings,

Andres Freund

In response to

Re: index prefetching at 2025-07-18 17:44:51 from Tomas Vondra

Responses

Re: index prefetching at 2025-07-18 21:44:26 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-07-18 21:03:59	Re: Adding basic NUMA awareness
Previous Message	Tom Lane	2025-07-18 20:48:08	Re: [PATCH] avoid double scanning in function byteain