Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-04-06 05:47:43
Message-ID: CAH2-Wzn=vwK+9mv5HuNfuaPkhT+0Ou-E1_Y8miGNJZm-WxHORg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 5, 2026 at 7:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > It's unclear if you're asking (or at least suggesting) that I should
> > invent a new table_index_getnext_slot-like shim, used only during
> > index-only scans, that doesn't require passing a slot (but does
> > require getting a table slot some other way, likely at scan startup).
>
> Why would it require creating a tableam shaped slot internally? It seems
> rather silly to store a tuple fetched to verify visibility into a slot, where
> it is never used. Storing in a tuple requires an indirect things like
> IncrBufferRefCount(), which isn't that cheap.
>
> heapam_index_getnext_slot(index_only=>1)
> -> heapam_index_fetch_heap_item()
> -> heapam_index_fetch_tuple()

The fact is that nodeIndexOnlyscan.c calls index_fetch_heap right now,
on master, which uses a TupleTableSlot.

Yes, using a slot is suboptimal. But it's nothing new. And this is the
first time I'm hearing about this being a problem. A couple of days
before feature freeze.

> So I guess yes, I am suggesting that we have a separate
> table_index_getnext_ios() or such, which is only usable for index only scans.

What if we offered such an interface but still used a slot internally?
That way we'd have the API you're asking for, without any significant
changes to heap_hot_search_buffer. The caller doesn't need to know
about the table slot's usage.

I get that making the caller use a slot interface with a slot that
they don't exactly need is a bit hokey. We can fix that part, without
seriously risking not getting the patch into 19.

> > Actually, I don't know if we really *should* treat the selfuncs.c
> > caller any different (except as needed to avoid spurious assertion
> > failures).

> I don't think it's worth assuming that that's the code in the core
> infrstructure.

To be clear, I agree.

> scan->heapRelation->rd_tableam->index_fetch_batch_init(scan, batch, new_alloc);
>
> that's a lot of indirection to go through.
>
>
> Which is what then made me think of embedding something like a
> TableIndexFetchRoutine (with batch_init, reset, markpos, restrpos, end
> members) in IndexScanDesc. That way table_index_{fetch_reset,batch_Init,...}()
> wouldn't need this level of indirection.

But that's the same level of indirection as before? Except it's once
per batch instead of once per tuple.

> Not having the same code in multiple table AMs would make it easier to write
> them, which would be nice in itself, but I am more concerned with ending up
> with copies of heapam_index_getnext_scanbatch_pos() in various AMs, us finding
> problems with heapam_index_getnext_scanbatch_pos() in a minor release, and
> having a harder time adjusting things due to the copied code.

I'll need to study this further.

> > What do you want to do about it? index_getnext_tid isn't adding too
> > much right now, I'd say.
>
> It does indeed not add a whole lot.

> I would move it into a tableam_util_fetch_next_tid() (or such) helper, I
> think. That also makes it easier to get rid of the table_index_fetch_reset()
> and move the pgstat_count_index_tuples(scan->indexRelation, 1) into one place.

I'll put it into indexbatch.h (next to the function that calls
amgetbatch), and make sure that the amgettuple path is consistent in
terms of how it calls pgstat_count_index_tuples(), and how the
fetch/index scan gets reset.

I don't think that we should be resetting here -- even if it's the
historical behavior. I'm absolutely sure that whatever we do shouldn't
vary based on whether amgetbatch or amgettuple was used (that was not
intended).

> > > Can'tt the *2 lead to computing a xs_vm_items bigger than scan->maxitemsbatch?
> >
> > Yes, it can.
>
> But we prevent it from overflow by a Max()/Min() in the loop headers.

Right.

> > > I'd set scan->batchcache[i] to NULL after freeing.
> >
> > I tried that out, but I *think* it regressed pgbench SELECT by about
> > ~1.5% of total TPS (I'd need more thorough testing to confirm this).
> > Are you sure we need this?
>
> That seems surprising. But no, I don't think we need this.

I'll revisit, might have been wrong.

> It's perhaps ok, I just was a bit surprised to see the callback inside a
> function with that name. I'd perhaps just name it
> tableam_util_release_batch() (and rename batch_free accordingly).

Will do.

> I'd probably just assert !kill_prior_tuple.

Will do.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Xuneng Zhou 2026-04-06 06:01:11 Re: Implement waiting for wal lsn replay: reloaded
Previous Message Etsuro Fujita 2026-04-06 05:32:55 Re: Asynchronous MergeAppend