Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-08-19 18:22:00
Message-ID:	CAH2-WzkEmu9+sn+M-6fnxUYE3VOP1J3a47M81asRA4FyKTWC0g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Aug 19, 2025 at 1:23 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Thanks for investigating this. I think it's the right direction - simple
> OLTP queries should not be paying for building read_stream when there's
> little chance of benefit.
>
> Unfortunately, this seems to be causing regressions, both compared to
> master (or disabled prefetching), and to the earlier prefetch patches.

> The main difference in the explains is this:
>
> Prefetch Distance: 2.066 (new patch)
>
> Prefetch Distance: 87.970 (old patch, without priorbatch)
>
> The histogram just confirms this, with most prefetches either in [2,4)
> or [64,128) bins. The new patch has much lower prefetch distance.

That definitely seems like a problem. I think that you're saying that
this problem happens because we have extra buffer hits earlier on,
which is enough to completely change the ramp-up behavior. This seems
to be all it takes to dramatically decrease the effectiveness of
prefetching. Does that summary sound correct?

> I believe this is the same issue with "collapsed" distance after
> resetting the read_stream. In that case the trouble was the reset also
> set distance to 1, and there were so many "hits" due to buffers read
> earlier it never ramped up again (we doubled it every now and then, but
> the decay was faster).

If my summary of what you said is accurate, then to me the obvious
question is: isn't this also going to be a problem *without* the new
"delay starting read stream" behavior? Couldn't you break the "removed
priorbatch" case in about the same way using a slightly different test
case? Say a test case involving concurrent query execution?

More concretely: what about similar cases where some *other* much more
selective query runs around the same time as the nonselective
regressed query? What if this other selective query reads the same
group of heap pages into shared_buffers that our nonselective query
will also need to visit (before visiting all the other heap pages not
yet in shared_buffers, that we want to prefetch)? Won't this other
scenario also confuse the read stream ramp-up heuristics, in a similar
way?

It seems bad that the initial conditions that the read stream sees can
have such lasting consequences. It feels as if the read stream is
chasing its own tail. I wonder if this is related to the fact that
we're using the read stream in a way that it wasn't initially
optimized for. After all, we're the first caller that doesn't just do
sequential access all the time -- we're bound to have novel problems
with the read stream for that reason alone.

> The same thing happens here, I think. We process the first batch without
> using a read stream. Then after reading the second batch we create the
> read_stream, but it starts with distance=1 - it's just like after reset.
> And it never ramps up the distance, because of the hits from reading the
> preceding batch.

> Maybe it'd be possible to track some stats, during the initial phase,
> and then use that to initialize the distance for the first batch
> processed by read stream? Seems rather inconvenient, though.

But why should the stats from the first leaf page read be particularly
important? It's just one page out of the thousands that are ultimately
read. Unless I've misunderstood you, the real problem seems to be that
the read stream effectively gets fixated on a few early buffer hits.
It sounds like it is getting stuck in a local minima, or something
like that.

> What exactly is the overhead of creating the read_stream? Is that about
> allocating memory, or something else?

It's hard to be precise here, because we're only talking about a 3%
regression with pgbench. A lot of that regression probably related to
memory allocation overhead. I also remember get_tablespace() being
visible in profiles (it is called from
get_tablespace_maintenance_io_concurrency, which is itself called from
read_stream_begin_impl). It's probably a lot of tiny things, that all
add up to a small (though still unacceptable) regression.

> Would it be possible to reduce the
> overhead enough to not matter even for OLTP queries?

> Maybe it would be
> possible to initialize the read_stream only "partially", enough to do do
> sync I/O and track the distance, and delay only the expensive stuff?

Maybe, but I think that this is something to consider only after other
approaches to fixing the problem fail.

> I'm also not sure it's optimal to only initialize read_stream after
> reading the next batch. For some indexes a batch can have hundreds of
> items, and that certainly could benefit from prefetching.

That does seem quite possible, and should also be investigated. But it
doesn't sound like the issue you're seeing with your adversarial
random query.

> I suppose it
> should be possible to initialize the read_stream half-way though a
> batch, right? Or is there a reason why that can't work?

Yes, that's right -- the general structure should be able to support
switching over to a read stream when we're only mid-way through
reading the TIDs associated with a given batch (likely the first
batch). The only downside is that that'd require adding logic/more
branches to heapam_index_fetch_tuple to detect when to do this. I
think that that approach is workable, if we really need it to work --
it's definitely an option.

For now I would like to focus on debugging your problematic query
(which doesn't sound like the kind of query that could benefit from
initializing the read_stream when we're still only half-way through a
batch). Does that make sense, do you think?

--
Peter Geoghegan

In response to

Re: index prefetching at 2025-08-19 17:23:24 from Tomas Vondra

Responses

Re: index prefetching at 2025-08-19 22:27:56 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kirill Reshke	2025-08-19 18:23:01	Re: VM corruption on standby
Previous Message	Nathan Bossart	2025-08-19 18:21:12	Re: Proper object locking for GRANT/REVOKE