Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-06-08 22:17:36
Message-ID: 5dbcbdf1-fd9f-89f6-eab4-bd3c33cf53a4@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/8/23 20:56, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> We already do prefetching for bitmap index scans, where the bitmap heap
>> scan prefetches future pages based on effective_io_concurrency. I'm not
>> sure why exactly was prefetching implemented only for bitmap scans, but
>> I suspect the reasoning was that it only helps when there's many
>> matching tuples, and that's what bitmap index scans are for. So it was
>> not worth the implementation effort.
>
> I have an educated guess as to why prefetching was limited to bitmap
> index scans this whole time: it might have been due to issues with
> ScalarArrayOpExpr quals.
>
> Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
> "natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
> were supported by both index scans and index-only scans -- not just
> bitmap scans, which could handle ScalarArrayOpExpr quals even without
> nbtree directly understanding them. The commit was in late 2011,
> shortly after the introduction of index-only scans -- which seems to
> have been the real motivation. And so it seems to me that support for
> ScalarArrayOpExpr was built with bitmap scans and index-only scans in
> mind. Plain index scan ScalarArrayOpExpr quals do work, but support
> for them seems kinda perfunctory to me (maybe you can think of a
> specific counter-example where plain index scans really benefit from
> ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
> original motivation).
>
I don't think SAOP is the reason. I did a bit of digging in the list
archives, and found thread [1], which says:

Regardless of what mechanism is used and who is responsible for
doing it someone is going to have to figure out which blocks are
specifically interesting to prefetch. Bitmap index scans happen
to be the easiest since we've already built up a list of blocks
we plan to read. Somehow that information has to be pushed to the
storage manager to be acted upon.

Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.

So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.

There's a couple more ~2008 threads mentioning prefetching, bitmap scans
and even regular index scans (like [2]). None of them even mentions SAOP
stuff at all.

[1]
https://www.postgresql.org/message-id/871wa17vxb.fsf%40oxford.xeocode.com

[2]
https://www.postgresql.org/message-id/87wsnnz046.fsf%40oxford.xeocode.com

> ScalarArrayOpExpr for plain index scans don't really make that much
> sense right now because there is no heap prefetching in the index scan
> case, which is almost certainly going to be the major bottleneck
> there. At the same time, adding useful prefetching for
> ScalarArrayOpExpr execution more or less requires that you first
> improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
> in mind that ScalarArrayOpExpr execution (whether for bitmap index
> scans or index scans) is related to skip scan/MDAM techniques -- so
> there are tricky dependencies that need to be considered together.
>
> Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
> descend the B-Tree for each array constant -- even though in principle
> we could avoid all that work in cases that happen to have locality. In
> other words we'll often descend the tree multiple times and land on
> exactly the same leaf page again and again, without ever noticing that
> we could have gotten away with only descending the tree once (it'd
> also be possible to start the next "descent" one level up, not at the
> root, intelligently reusing some of the work from an initial descent
> -- but you don't need anything so fancy to greatly improve matters
> here).
>
> This lack of smarts around how many times we call _bt_first() to
> descend the index is merely a silly annoyance when it happens in
> btgetbitmap(). We do at least sort and deduplicate the array up-front
> (inside _bt_sort_array_elements()), so there will be significant
> locality of access each time we needlessly descend the tree.
> Importantly, there is no prefetching "pipeline" to mess up in the
> bitmap index scan case -- since that all happens later on. Not so for
> the superficially similar (though actually rather different) plain
> index scan case -- at least not once you add prefetching. If you're
> uselessly processing the same leaf page multiple times, then there is
> no way that heap prefetching can notice that it should be batching
> things up. The context that would allow prefetching to work well isn't
> really available right now. So the plain index scan case is kinda at a
> gratuitous disadvantage (with prefetching) relative to the bitmap
> index scan case.
>
> Queries with (say) quals with many constants appearing in an "IN()"
> are both common and particularly likely to benefit from prefetching.
> I'm not suggesting that you need to address this to get to a
> committable patch. But you should definitely think about it now. I'm
> strongly considering working on this problem for 17 anyway, so we may
> end up collaborating on these aspects of prefetching. Smarter
> ScalarArrayOpExpr execution for index scans is likely to be quite
> compelling if it enables heap prefetching.
>
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).

So if you're planning to work on this for PG17, collaborating on it
would be great.

For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.

>> But there's three shortcomings in logic:
>>
>> 1) It's not clear the thresholds for prefetching being beneficial and
>> switching to bitmap index scans are the same value. And as I'll
>> demonstrate later, the prefetching threshold is indeed much lower
>> (perhaps a couple dozen matching tuples) on large tables.
>
> As I mentioned during the pgCon unconference session, I really like
> your framing of the problem; it makes a lot of sense to directly
> compare an index scan's execution against a very similar bitmap index
> scan execution -- there is an imaginary continuum between index scan
> and bitmap index scan. If the details of when and how we scan the
> index are rather similar in each case, then there is really no reason
> why the performance shouldn't be fairly similar. I suspect that it
> will be useful to ask the same question for various specific cases,
> that you might not have thought about just yet. Things like
> ScalarArrayOpExpr queries, where bitmap index scans might look like
> they have a natural advantage due to an inherent need for random heap
> access in the plain index scan case.
>

Yeah, although all the tests were done with a random table generated
like this:

insert into btree_test select $d * random(), md5(i::text)
from generate_series(1, $ROWS) s(i)

So it's damn random anyway. Although maybe it's random even for the
bitmap case, so maybe if the SAOP had some sort of locality, that'd be
an advantage for the bitmap scan. But how would such table look like?

I guess something like this might be a "nice" bad case:

insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)

select * from btree_test where a in (999, 1000, 1001, 1002)

The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.

> It's important to carefully distinguish between cases where plain
> index scans really are at an inherent disadvantage relative to bitmap
> index scans (because there really is no getting around the need to
> access the same heap page many times with an index scan) versus cases
> that merely *appear* that way. Implementation restrictions that only
> really affect the plain index scan case (e.g., the lack of a
> reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
> should be accounted for when assessing the viability of index scan +
> prefetch over bitmap index scan + prefetch. This is very subtle, but
> important.
>

I do agree, but what do you mean by "assessing"? Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?

If we pick index scan and enable prefetching, causing a regression (e.g.
for the SAOP with locality), that'd be bad. But how is that related to
viability of index scans over bitmap index scans?

> That's what I was mostly trying to get at when I talked about testing
> strategy at the unconference session (this may have been unclear at
> the time). It could be done in a way that helps you to think about the
> problem from first principles. It could be really useful as a way of
> avoiding confusing cases where plain index scan + prefetch does badly
> due to implementation restrictions, versus cases where it's
> *inherently* the wrong strategy. And a testing strategy that starts
> with very basic ideas about what I/O is truly necessary might help you
> to notice and fix regressions. The difference will never be perfectly
> crisp, of course (isn't bitmap index scan basically just index scan
> with a really huge prefetch buffer anyway?), but it still seems like a
> useful direction to go in.
>

I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.

>> Implementation
>> --------------
>>
>> When I started looking at this, I only really thought about btree. If
>> you look at BTScanPosData, which is what the index scans use to
>> represent the current leaf page, you'll notice it has "items", which is
>> the array of item pointers (TIDs) that we'll fetch from the heap. Which
>> is exactly the thing we need.
>
>> So I ended up moving most of the prefetching logic up into indexam.c,
>> see the index_prefetch() function. It can't be entirely separate,
>> because each AM represents the current state in a different way (e.g.
>> SpGistScanOpaque and BTScanOpaque are very different).
>
> Maybe you were right to do that, but I'm not entirely sure.
>
> Bear in mind that the ScalarArrayOpExpr case already looks like a
> single index scan whose qual involves an array to the executor, even
> though nbtree more or less implements it as multiple index scans with
> plain constant quals (one per unique-ified array element). Index scans
> whose results can be "OR'd together". Is that a modularity violation?
> And if so, why? As I've pointed out earlier in this email, we don't do
> very much with that context right now -- but clearly we should.
>
> In other words, maybe you're right to suspect that doing this in AMs
> like nbtree is a modularity violation. OTOH, maybe it'll turn out that
> that's exactly the right place to do it, because that's the only way
> to make the full context available in one place. I myself struggled
> with this when I reviewed the skip scan patch. I was sure that Tom
> wouldn't like the way that the skip-scan patch doubles-down on adding
> more intelligence/planning around how to execute queries with
> skippable leading columns. But, it turned out that he saw the merit in
> it, and basically accepted that general approach. Maybe this will turn
> out to be a little like that situation, where (counter to intuition)
> what you really need to do is add a new "layering violation".
> Sometimes that's the only thing that'll allow the information to flow
> to the right place. It's tricky.
>

There are two aspects why I think AM is not the right place:

- accessing table from index code seems backwards

- we already do prefetching from the executor (nodeBitmapHeapscan.c)

It feels kinda wrong in hindsight.

>> 4) per-leaf prefetching
>>
>> The code is restricted only prefetches items from one leaf page. If the
>> index scan needs to scan multiple (many) leaf pages, we have to process
>> the first leaf page first before reading / prefetching the next one.
>>
>> I think this is acceptable limitation, certainly for v0. Prefetching
>> across multiple leaf pages seems way more complex (particularly for the
>> cases using pairing heap), so let's leave this for the future.
>
> I tend to agree that this sort of thing doesn't need to happen in the
> first committed version. But FWIW nbtree could be taught to scan
> multiple index pages and act as if it had just processed them as one
> single index page -- up to a point. This is at least possible with
> plain index scans that use MVCC snapshots (though not index-only
> scans), since we already drop the pin on the leaf page there anyway.
> AFAICT stops us from teaching nbtree to "lie" to the executor and tell
> it that we processed 1 leaf page, even though it was actually 5 leaf pages
> (maybe there would also have to be restrictions for the markpos stuff).
>

Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.

>> the results look a bit different:
>>
>> rows bitmapscan master patched seqscan
>> 1 52703.9 19.5 19.5 31145.6
>> 10 51208.1 22.7 24.7 30983.5
>> 100 49038.6 39.0 26.3 32085.3
>> 1000 53760.4 193.9 48.4 31479.4
>> 10000 56898.4 1600.7 187.5 32064.5
>> 100000 50975.2 15978.7 1848.9 31587.1
>>
>> This is a good illustration of a query where bitmapscan is terrible
>> (much worse than seqscan, in fact), and the patch is a massive
>> improvement over master (about an order of magnitude).
>>
>> Of course, if you only scan a couple rows, the benefits are much more
>> modest (say 40% for 100 rows, which is still significant).
>
> Nice! And, it'll be nice to be able to use the kill_prior_tuple
> optimization in many more cases (possible by teaching the optimizer to
> favor index scans over bitmap index scans more often).
>

Right, I forgot to mention that benefit. Although, that'd only happen if
we actually choose index scans in more places, which I guess would
require tweaking the costing model ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-06-08 22:18:07 Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15
Previous Message Greg Stark 2023-06-08 22:08:08 Re: Fix search_path for all maintenance commands