Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-09-15 15:12:16
Message-ID:	CAH2-Wzk588ESCcigCURKbOHiyFEvTX-2CdyUi+Nv5RCYJ5Z15Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Sep 15, 2025 at 9:00 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Yeah, this heuristics seems very effective in eliminating the regression
> (at least judging by the test results I've seen so far). Two or three
> question bother me about it, though:

I more or less agree with all of your concerns about the
INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.

> We're prefetching too close ahead, so close the I/O can't possibly
> complete, and the overhead of submitting the I/O using AIO is higher
> than what what async "saves".
>
> That's great, but is the distance a good measure of that?

The big underlying problem is that the INDEX_SCAN_MIN_TUPLE_DISTANCE
heuristic was developed in a way that overrelied on brute-force
testing. It probably is still flawed in some specific way that some
query will actually run into, that cannot be justified.

So at this point INDEX_SCAN_MIN_TUPLE_DISTANCE should still be
considered nothing more than a promising experiment. It's promising
because it does appear to work with a large variety of queries (we
know of no query where it doesn't basically work right now). My hope
is that we'll be able to come up with a simpler and more robust
approach. One where we fully understand the downsides.

> 2) It's a one-time decision, not adaptive. We start prefetching, and
> then at some point (not too long after the scan starts) we make a
> decision whether to continue with prefetching or not. And if we disable
> it, it's disabled forever. That's fine for the synthetic data sets we
> use for testing, because those are synthetic. I'm not sure it'll work
> this well for real-world data sets where different parts of the file may
> be very different.

We'll probably need to accept some hard trade-off in this area.

In general (not just with this patch), prefetching works through trial
and error -- the errors are useful information, and useful information
isn't free. The regressions that the INDEX_SCAN_MIN_TUPLE_DISTANCE
heuristic addresses are cases where the errors seem unlikely to pay
for themselves. Let's not forget that these are not huge regressions
-- it's not as if the patch ever does completely the wrong thing
without INDEX_SCAN_MIN_TUPLE_DISTANCE. It's more like it hurts us to
be constantly on the verge of doing the right thing, but never quite
doing the right thing.

Fundamentally, we need to be willing to pay for the cost of
information through which we might be able to do better. We might be
able to get the cost down, through some kind of targeted optimization,
but it's unlikely to ever be close to free.

> This is perfectly fine for a WIP patch, but I believe we should try to
> make this adaptive. Which probably means we need to invent a "light"
> version of read_stream that initially does sync I/O, and only switches
> to async (with all the expensive initialization) later. And then can
> switch back to sync, but is ready to maybe start prefetching again if
> the data pattern changes.

That does seem like it'd be ideal. But how are we supposed to decide
to switch back?

Right now, disabling prefetching disables the only way that we have to
notice that prefetching might be useful (which is to notice that we're
failing to keep up with our prefetch distance). Without
INDEX_SCAN_MIN_TUPLE_DISTANCE, for those queries where prefetch
distance collapses to ~2.0, we really can "decide to switch back to
prefetching". But maintaining the option of switching back costs us
too much (that's what we need INDEX_SCAN_MIN_TUPLE_DISTANCE to
manage).

> 3) Now that I look at the code in index_scan_stream_read_next, it feels
> a bit weird we do the decision based on the "immediate" distance only. I
> suspect this may make it quite fragile, in the sense that even a small
> local irregularity in the data may result in different "step" changes.
> Wouldn't it be better to base this on some "average" distance?
>
> In other words, I'm afraid (2) and (3) are pretty much a "performance
> cliff", where a tiny difference in the input can result in wildly
> different behavior.

You can say the same thing about hash join spilling. It might not be
practical to make a strong guarantee that this will never ever happen.
It might be more useful to focus on finding a way that makes it as
rare as possible.

If problems like this are possible, but require a "perfect storm" of
buffer hits and misses that occur in precisely the same order, then
maybe it can't be too much of a problem in practice. Since it
shouldn't won't occur again and again.

> > Much cleanup work remains to get the changes I just described in
> > proper shape (to say nothing about open items that we haven't made a
> > start on yet, like moving the read stream out of indexam.c and into
> > heapam). But it has been too long since the last revision. I'd like to
> > establish a regular cadence for posting new revisions of the patch
> > set.
> >
>
> Thank you! I appreciate the collaboration, it's a huge help.

I've enjoyed our collaboration. Feels like things are definitely
moving in the right direction. This is definitely a challenging
project.

> I kept running the stress test, trying to find cases that regress, and
> also to better understand the behavior. The script/charts are available
> here: https://github.com/tvondra/prefetch-tests
>
> So far I haven't found any massive regressions (relative to master).
> There are data sets where we regress by a couple percent (and it's not
> noise). I haven't looked into the details, but I believe most of this
> can be attributed to the "AIO costs" we discussed recently (with
> signals), and similar things.

The overall picture that these tests show is a positive one. I think
that this might actually be an acceptable performance profile, across
the board.

What's not acceptable is the code itself, and the current uncertainty
about how fragile our current approach is. I hope that we can make it
less fragile in the coming weeks.

--
Peter Geoghegan

In response to

Re: index prefetching at 2025-09-15 13:00:42 from Tomas Vondra

Responses

Re: index prefetching at 2025-09-15 19:28:37 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Emmanuel Sibi	2025-09-15 15:19:01	Re: [BUG] PostgreSQL crashes with ThreadSanitizer during early initialization
Previous Message	Melanie Plageman	2025-09-15 14:58:28	Re: Incorrect logic in XLogNeedsFlush()