| From: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me>, Peter Geoghegan <pg(at)bowt(dot)ie> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Subject: | Re: index prefetching |
| Date: | 2026-01-05 12:21:31 |
| Message-ID: | dcbc802c-97c8-48ea-a40f-358482b9def6@garret.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 05/01/2026 2:21 AM, Tomas Vondra wrote:
> Even a small fixes overhead can be quite visible for extremely short
> queries that can't possibly benefit from prefetching (like the LIMIT 1).
>
> That's what we were concerned about when we invented this heuristics
> (which only prefetches from the 2nd batch). I agree it's crude, no
> argument there. I'm not against having something better, as long as we
> can make it reliable.
I completely agree that we should not create read stream from the very
beginning despite to small overhead.
My concern is that second batch is not good criteria: depending on key
size and position of the first key on the page, it can be too short.
> One of the problems with read_stream is that it may look ahead very far
> ahead. Much further than the query will need. Imagine an index-only
> scan, with 99.99% pages being all-visible. At some point the scan will
> hit an item that requires reading the heap page. Which triggers the
> look-ahead, and a search for the *next* heap page.
>
> But that next index entry may be many leafs ahead. And maybe the scan
> never even gets to actually need that, which means the work will be
> wasted. If the query only needs a couple tuples, this may cause a bad
> regression.
Sorry, I do not completely understand how it can happen.
Read stream can only fetch heap pages which are referenced by TIDs in
leave pages (leaf=batch).
So read stream can not advance more than TIDs from currently processes
leaf page, can it?
> IIRC the same thing can happen for plain index scans. You may need to
> try with correlated indexes (I think the read_stream callback skips
> duplicate TIDs).
Another problem is that different TIDs can refer to the same heap page
(which happens with clustered index or table populated in key ascending
order).
But number of SMGR reads criteria also should work good in this case.
> When all data is cached in shared buffers, then we do not perform IO at
> all.
> It means there it doesn't matter whether and when we initialize
> read_stream.
> We can do it after processing 10 items (current default), or immediately
> - it should not affect performance.
> And this is what I have tested: performance actually not depends on
> `read_stream_threshold` (if data fits in shared buffers).
> At least it is within few percents and may be it is just random
> fluctuations.
> Obviously there is no 25% degradation.
>
> I'm not sure it's this simple. Even if we perform no physical I/O, the
> read_stream callback will need to go through the whole batch and pass
> all the items to the read_stream. The data may be cached, but not in
> shared buffers, in which case the read_stream will do the IPC etc.
You mean that data is present in OS file cache and reading it is fast
and will not benefit from AIO?
It certainly can happen but it seems to be quite obvious problem of
double buffering.
> It definitely doesn't mean that it is not possible to find scenario
> where this approach with enabling prefetch after processing N items will
> show worse performance than master or v5. We just need to properly
> choose cache hit rate. But the same is true IMHO for v5 itself: it is
> possible to find workload where it will show the same degradation
> comparing with master.
>
> I'm sure such "adversarial" data sets exist, even if we don't have a
> good example at hand. The question is how plausible / realistic such
> data sets are, though.
>
> I'm not sure about v5, though. The assumption was the overhead is most
> visible for short queries, and once we get to the 2nd batch the query is
> expensive enough to to not be affected too much.
>
> When you say "we just need to properly choose cache hit rate", how would
> we do that? And how would we know/determine the optimal threshold?
Sorry, may be I was unclear.
Saying about "choosing cache hit rate" I didn't not mean to use it for
determining proper threshold.
I was thinking about the best/worst scenario for index prefetch.
Originally in my "benchmark" I considered case when no heap pages are
present in shared buffer.
In your scenario - size of shared buffers is larger than size of table,
so all pages are present in shared buffers.
First case is certainly the best scenario for index prefetch where we
can expect to get the largest improvement.
In your case there is certainly no improvement, but as we can see -
overhead is also not so large (just because we do not read any page).
I expect that the worst result will be in some intermediate case - when
some pages are present in shared buffer, some - not.
It will be especially true for "second batch" criteria, because it can
happen that we need to read just the single heap page for the whole
batch and using read stream for it will just add extra overhead.
>> More precise heuristic should IMHO take in account actual number of
>> performed disk read.
> I'm not sure "disk reads" is the right term. The data may be in page
> cache, and we just need to copy them into shared buffers.
I agree, but there seems to be no chance to determine if page was
actually read from disk or cache.
But as I wrote above, it is problem of double buffering. If we can avoid
it (i.e. using direct IO), then there will be no such problem.
>> Please notice that I do not want to predict number of disk reads - i.e.
>> check if candidates for prefetch are present in shared buffers.
>> It will really adds significant overhead. I think that it is better to
>> use as threshold number of performed reads.
>>
> Not sure I understand what this says. Can you elaborate?
I sent proposed patch in the previous mail.
Did you have a chance to look at it?
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Greg Burd | 2026-01-05 12:42:07 | Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ?barriers |
| Previous Message | Amit Kapila | 2026-01-05 12:18:21 | Re: DOCS - Clarify the publication 'publish_via_partition_root' default value. |