Quick Links

Re: index prefetching

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-09-15 19:28:37
Message-ID:	283fbf99-25ef-4e58-9ef4-08e92423e534@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 9/15/25 17:12, Peter Geoghegan wrote:
> On Mon, Sep 15, 2025 at 9:00 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> Yeah, this heuristics seems very effective in eliminating the regression
>> (at least judging by the test results I've seen so far). Two or three
>> question bother me about it, though:
>
> I more or less agree with all of your concerns about the
> INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.
>
>> We're prefetching too close ahead, so close the I/O can't possibly
>> complete, and the overhead of submitting the I/O using AIO is higher
>> than what what async "saves".
>>
>> That's great, but is the distance a good measure of that?
>
> The big underlying problem is that the INDEX_SCAN_MIN_TUPLE_DISTANCE
> heuristic was developed in a way that overrelied on brute-force
> testing. It probably is still flawed in some specific way that some
> query will actually run into, that cannot be justified.
>
> So at this point INDEX_SCAN_MIN_TUPLE_DISTANCE should still be
> considered nothing more than a promising experiment. It's promising
> because it does appear to work with a large variety of queries (we
> know of no query where it doesn't basically work right now). My hope
> is that we'll be able to come up with a simpler and more robust
> approach. One where we fully understand the downsides.
>

Agreed.

>> 2) It's a one-time decision, not adaptive. We start prefetching, and
>> then at some point (not too long after the scan starts) we make a
>> decision whether to continue with prefetching or not. And if we disable
>> it, it's disabled forever. That's fine for the synthetic data sets we
>> use for testing, because those are synthetic. I'm not sure it'll work
>> this well for real-world data sets where different parts of the file may
>> be very different.
>
> We'll probably need to accept some hard trade-off in this area.
>
> In general (not just with this patch), prefetching works through trial
> and error -- the errors are useful information, and useful information
> isn't free. The regressions that the INDEX_SCAN_MIN_TUPLE_DISTANCE
> heuristic addresses are cases where the errors seem unlikely to pay
> for themselves. Let's not forget that these are not huge regressions
> -- it's not as if the patch ever does completely the wrong thing
> without INDEX_SCAN_MIN_TUPLE_DISTANCE. It's more like it hurts us to
> be constantly on the verge of doing the right thing, but never quite
> doing the right thing.
>
> Fundamentally, we need to be willing to pay for the cost of
> information through which we might be able to do better. We might be
> able to get the cost down, through some kind of targeted optimization,
> but it's unlikely to ever be close to free.
>

True. Useful information is not free, and we can construct "adversary"
cases for any heuristics. But I'd like to be sure the hard trade off
really is inevitable.

>> This is perfectly fine for a WIP patch, but I believe we should try to
>> make this adaptive. Which probably means we need to invent a "light"
>> version of read_stream that initially does sync I/O, and only switches
>> to async (with all the expensive initialization) later. And then can
>> switch back to sync, but is ready to maybe start prefetching again if
>> the data pattern changes.
>
> That does seem like it'd be ideal. But how are we supposed to decide
> to switch back?
>
> Right now, disabling prefetching disables the only way that we have to
> notice that prefetching might be useful (which is to notice that we're
> failing to keep up with our prefetch distance). Without
> INDEX_SCAN_MIN_TUPLE_DISTANCE, for those queries where prefetch
> distance collapses to ~2.0, we really can "decide to switch back to
> prefetching". But maintaining the option of switching back costs us
> too much (that's what we need INDEX_SCAN_MIN_TUPLE_DISTANCE to
> manage).
>

I imagined (with no code to support it) we'd do the sync I/O through the
read_stream. That way it'd know about the buffer hits and misses, and
could calculate the "distance" (even if it's not used by the sync I/O).
Sure, it's not perfect, because "stream distance" is not the same as
"tuple distance". But we could calculate the "tuple distance", no?

In the "sync" mode the stream could also switch to non-AIO reads,
eliminating the signal bottleneck.

>> 3) Now that I look at the code in index_scan_stream_read_next, it feels
>> a bit weird we do the decision based on the "immediate" distance only. I
>> suspect this may make it quite fragile, in the sense that even a small
>> local irregularity in the data may result in different "step" changes.
>> Wouldn't it be better to base this on some "average" distance?
>>
>> In other words, I'm afraid (2) and (3) are pretty much a "performance
>> cliff", where a tiny difference in the input can result in wildly
>> different behavior.
>
> You can say the same thing about hash join spilling. It might not be
> practical to make a strong guarantee that this will never ever happen.
> It might be more useful to focus on finding a way that makes it as
> rare as possible.
>

Sure, it applies to various places where we "flip" to a different
execution mode. All I'm saying is maybe we should try not to add more
such cases.

> If problems like this are possible, but require a "perfect storm" of
> buffer hits and misses that occur in precisely the same order, then
> maybe it can't be too much of a problem in practice. Since it
> shouldn't won't occur again and again.
>

I'm not sure it's such a "perfect storm", really. Imagine an index where
half the leafs are "nice" end get very high indexdiff values, while the
other half are "not nice" and get very low indexdiff. It's a matter of
random chance which leaf you get at INDEX_SCAN_MIN_DISTANCE_NBATCHES.

>>> Much cleanup work remains to get the changes I just described in
>>> proper shape (to say nothing about open items that we haven't made a
>>> start on yet, like moving the read stream out of indexam.c and into
>>> heapam). But it has been too long since the last revision. I'd like to
>>> establish a regular cadence for posting new revisions of the patch
>>> set.
>>>
>>
>> Thank you! I appreciate the collaboration, it's a huge help.
>
> I've enjoyed our collaboration. Feels like things are definitely
> moving in the right direction. This is definitely a challenging
> project.
>
>> I kept running the stress test, trying to find cases that regress, and
>> also to better understand the behavior. The script/charts are available
>> here: https://github.com/tvondra/prefetch-tests
>>
>> So far I haven't found any massive regressions (relative to master).
>> There are data sets where we regress by a couple percent (and it's not
>> noise). I haven't looked into the details, but I believe most of this
>> can be attributed to the "AIO costs" we discussed recently (with
>> signals), and similar things.
>
> The overall picture that these tests show is a positive one. I think
> that this might actually be an acceptable performance profile, across
> the board.
>

Perhaps.

> What's not acceptable is the code itself, and the current uncertainty
> about how fragile our current approach is. I hope that we can make it
> less fragile in the coming weeks.
>

Agreed.

regards

--
Tomas Vondra

In response to

Re: index prefetching at 2025-09-15 15:12:16 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-09-15 19:43:46	Re: plan shape work
Previous Message	Tom Lane	2025-09-15 19:16:21	Re: --with-llvm on 32-bit platforms?