Quick Links

Re: index prefetching

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-08-14 23:21:15
Message-ID:	602e0029-1619-4288-acec-8a38e6fb9e6a@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 8/14/25 23:55, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> If this same mechanism remembered (say) the last 2 heap blocks it
>> requested, that might be enough to totally fix this particular
>> problem. This isn't a serious proposal, but it'll be simple enough to
>> implement. Hopefully when I do that (which I plan to soon) it'll fully
>> validate your theory.
>
> I spoke too soon. It isn't going to be so easy, since
> heapam_index_fetch_tuple wants to consume buffers as a simple stream.
> There's no way that index_scan_stream_read_next can just suppress
> duplicate block number requests (in a way that's more sophisticated
> than the current trivial approach that stores the very last block
> number in IndexScanBatchState.lastBlock) without it breaking the whole
> concept of a stream of buffers.
>

I believe this idea (checking not just the very last block, but keeping
a bit longer history) was briefly discussed a couple months ago, after
you pointed out the need for the "last block" optimization (which the
patch didn't have). At that point we were focused on addressing a
regression with correlated indexes, so the single block was enough.

But as you point out, it's harder than it seems. If I recall correctly,
the challenge is that heapam_index_fetch_tuple() is expected to release
the block when it changes, but then how would it know there's no future
read of the same buffer in the stream?

>>> We can optimize that by deferring the StartBufferIO() if we're encountering a
>>> buffer that is undergoing IO, at the cost of some complexity. I'm not sure
>>> real-world queries will often encounter the pattern of the same block being
>>> read in by a read stream multiple times in close proximity sufficiently often
>>> to make that worth it.
>>
>> We definitely need to be prepared for duplicate prefetch requests in
>> the context of index scans.
>
> Can you (or anybody else) think of a quick and dirty way of working
> around the problem on the read stream side? I would like to prioritize
> getting the patch into a state where its overall performance profile
> "feels right". From there we can iterate on fixing the underlying
> issues in more principled ways.
>
> FWIW it wouldn't be that hard to require the callback (in our case
> index_scan_stream_read_next) to explicitly point out that it knows
> that the block number it's requesting has to be a duplicate. It might
> make sense to at least place that much of the burden on the
> callback/client side.
>

I don't recall all the details, but IIRC my impression was it'd be best
to do this "caching" entirely in the read_stream.c (so the next_block
callbacks would probably not need to worry about lastBlock at all),
enabled when creating the stream. And then there would be something like
read_stream_release_buffer() that'd do the right to release the buffer
when it's not needed.

regards

--
Tomas Vondra

In response to

Re: index prefetching at 2025-08-14 21:55:53 from Peter Geoghegan

Responses

Re: index prefetching at 2025-08-15 01:47:47 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Noboru Saito	2025-08-14 23:25:28	Re: [PATCH] Proposal: Improvements to PDF stylesheet and table column widths
Previous Message	Masahiko Sawada	2025-08-14 23:07:54	Re: POC: enable logical decoding when wal_level = 'replica' without a server restart