Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-02-17 17:16:23
Message-ID: CAH2-Wzk-89uCvdJ1Q6NsM6LvDvUEt6Qy66T6A60J=D_voWxZDg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 16, 2026 at 11:48 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Yes, It's hurting quite substantially. For well correlated index scans it
> prevents readahead from becoming aggressive enough even on a local low latency
> SSD. Which means it'll not even be even remotely aggressive enough on a
> networked block store.

I agree that the current heuristics (which were invented recently) are
too conservative. I overfit the heuristics to my current set of
adversarial queries, as a stopgap measure.

> Note that there is pretty much *no* readhead, because the yields happen more
> frequently than a io_combine_limit sized IO can be formed.

ISTM that we need the yields to better cooperate with whatever's
happening on the read stream side.

> With the yielding logic disabled:

> The comment seems to say it's about avoiding to look very into the future when
> using index only scans that just need a few heap lookups. Certainly an
> important goal.

The main motivation for yielding is to deal with things like merge
joins fed by at least one plain index scan, and plain scans for an
"ORDER BY .... LIMIT N" query. The LIMIT N is also addressed by the
"Use ExecSetTupleBound hint during index scans" patch, albeit
imperfectly. This is also important with index-only scans, though only
when heap fetches are actually required in the first place (most
index-only scans will never create a read stream in the first place).

I attach an example of where disabling the yield mechanism hurts
instead of helping, to give you a sense of the problems in this area.
Notice that the inner side of the merge join ("Index Scan using
prefetch_customers_pkey on prefetch_customers c") requires the same
number of buffer hits + buffer reads as master with yielding enabled
-- that allows the patch to shave off 7% of master's execution time.
Whereas without yielding, there's quite a lot more buffer hits (for
index page reads) + buffer misses (for heap page reads) -- which hurts
us. Without yielding, the patch takes about 13% longer to execute the
query.

Just to be clear, I'm not arguing that this is the right trade-off.
And I understand that it's just not feasible to completely prevent
such plan shapes from reading more data than strictly necessary -- a
certain amount of that seems like a "cost of doing business". But it
seems important that the added costs of speculatively reading later
batches/index leaf pages never exceeds some fixed threshold.

The non-yielding EXPLAIN ANALYZE has an inner-index scan that does
~60% more work than master/than the yielding variant of the patch.
It's not so much that the amount of extra work we'll perform is
excessive (though it is); what really concerns me is that the amount
of extra work is *indeterminate*. There's likely to be other queries
that are much slower still, at least in the absence of some kind of
yielding mechanism. And so ISTM that we need a better yield mechanism
-- one that is better attuned to the need to maintain an adequate
prefetch distance on the read stream side.

> One thing that does confuse me about the yielding logic is that it seems to
> actually put a cap on ever looking more than two batches ahead (with a bit of
> fuzziness)? Why support more batches then (INDEX_SCAN_MAX_BATCHES == 128)?
> Isn't two batches too low for anything with some correlation on even remotely
> higher latency storage?

FWIW that's not actually true; we *can* still use more than 2 batches.
My instrumentation confirms this. Though it seems to top out at about
4 batches total with the current yield logic/with my test suite
queries.

--
Peter Geoghegan

Attachment Content-Type Size
plans-yield-noyield.txt text/plain 4.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan Gonzalez V. 2026-02-17 17:18:16 Re: Make PGOAUTHCAFILE in libpq-oauth work out of debug mode
Previous Message Andreas Karlsson 2026-02-17 17:11:45 Re: add warning upon successful md5 password auth