Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-06-09 10:44:46
Message-ID: 970c1c5c-8af4-e2ac-b985-e66331e39701@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/9/23 01:38, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> Normal index scans are an even more interesting case but I'm not
>> sure how hard it would be to get that information. It may only be
>> convenient to get the blocks from the last leaf page we looked at,
>> for example.
>>
>> So this suggests we simply started prefetching for the case where the
>> information was readily available, and it'd be harder to do for index
>> scans so that's it.
>
> What the exact historical timeline is may not be that important. My
> emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> compelling case for both parallel index scan and prefetching, in
> general. There are many queries that have huge in() lists that
> naturally benefit a great deal from prefetching. Plus they're common.
>

Did you mean parallel index scan or bitmap index scan?

But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.

>> Even if SAOP (probably) wasn't the reason, I think you're right it may
>> be an issue for prefetching, causing regressions. It didn't occur to me
>> before, because I'm not that familiar with the btree code and/or how it
>> deals with SAOP (and didn't really intend to study it too deeply).
>
> I'm pretty sure that you understand this already, but just in case:
> ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
> page" in many important cases. Not really -- not in the sense that
> you'd hope and expect. We're senselessly processing the same index
> leaf page multiple times and treating it as a different, independent
> leaf page. That makes heap prefetching of the kind you're working on
> utterly hopeless, since it effectively throws away lots of useful
> context. Obviously that's the fault of nbtree ScalarArrayOpExpr
> handling, not the fault of your patch.
>

I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?

I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.

>> So if you're planning to work on this for PG17, collaborating on it
>> would be great.
>>
>> For now I plan to just ignore SAOP, or maybe just disabling prefetching
>> for SAOP index scans if it proves to be prone to regressions. That's not
>> great, but at least it won't make matters worse.
>
> Makes sense, but I hope that it won't come to that.
>
> IMV it's actually quite reasonable that you didn't expect to have to
> think about ScalarArrayOpExpr at all -- it would make a lot of sense
> if that was already true. But the fact is that it works in a way
> that's pretty silly and naive right now, which will impact
> prefetching. I wasn't really thinking about regressions, though. I was
> actually more concerned about missing opportunities to get the most
> out of prefetching. ScalarArrayOpExpr really matters here.
>

OK

>> I guess something like this might be a "nice" bad case:
>>
>> insert into btree_test mod(i,100000), md5(i::text)
>> from generate_series(1, $ROWS) s(i)
>>
>> select * from btree_test where a in (999, 1000, 1001, 1002)
>>
>> The values are likely colocated on the same heap page, the bitmap scan
>> is going to do a single prefetch. With index scan we'll prefetch them
>> repeatedly. I'll give it a try.
>
> This is the sort of thing that I was thinking of. What are the
> conditions under which bitmap index scan starts to make sense? Why is
> the break-even point whatever it is in each case, roughly? And, is it
> actually because of laws-of-physics level trade-off? Might it not be
> due to implementation-level issues that are much less fundamental? In
> other words, might it actually be that we're just doing something
> stoopid in the case of plain index scans? Something that is just
> papered-over by bitmap index scans right now?
>

Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).

> I see that your patch has logic that avoids repeated prefetching of
> the same block -- plus you have comments that wonder about going
> further by adding a "small lru array" in your new index_prefetch()
> function. I asked you about this during the unconference presentation.
> But I think that my understanding of the situation was slightly
> different to yours. That's relevant here.
>
> I wonder if you should go further than this, by actually sorting the
> items that you need to fetch as part of processing a given leaf page
> (I said this at the unconference, you may recall). Why should we
> *ever* pin/access the same heap page more than once per leaf page
> processed per index scan? Nothing stops us from returning the tuples
> to the executor in the original logical/index-wise order, despite
> having actually accessed each leaf page's pointed-to heap pages
> slightly out of order (with the aim of avoiding extra pin/unpin
> traffic that isn't truly necessary). We can sort the heap TIDs in
> scratch memory, then do our actual prefetching + heap access, and then
> restore the original order before returning anything.
>

I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.

I mean, imagine we have TIDs

[T1, T2, T3, T4, T5, T6]

Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:

[T1, T5, T6, T2, T3, T4]

But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.

> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.
>
> I'm talking about problems that exist today, without your patch.
>
> I'll show a concrete example of the kind of index/index scan that
> might be affected.
>
> Attached is an extract of the server log when the regression tests ran
> against a server patched to show custom instrumentation. The log
> output shows exactly what's going on with one particular nbtree
> opportunistic deletion (my point has nothing to do with deletion, but
> it happens to be convenient to make my point in this fashion). This
> specific example involves deletion of tuples from the system catalog
> index "pg_type_typname_nsp_index". There is nothing very atypical
> about it; it just shows a certain kind of heap fragmentation that's
> probably very common.
>
> Imagine a plain index scan involving a query along the lines of
> "select * from pg_type where typname like 'part%' ", or similar. This
> query runs an instant before the example LD_DEAD-bit-driven
> opportunistic deletion (a "simple deletion" in nbtree parlance) took
> place. You'll be able to piece together from the log output that there
> would only be about 4 heap blocks involved with such a query. Ideally,
> our hypothetical index scan would pin each buffer/heap page exactly
> once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
> we're talking about a fairly selective query here, that only needs to
> scan precisely one leaf page (I verified this part too) -- so why
> wouldn't we expect "index scan parity"?
>
> While there is significant clustering on this example leaf page/key
> space, heap TID is not *perfectly* correlated with the
> logical/keyspace order of the index -- which can have outsized
> consequences. Notice that some heap blocks are non-contiguous
> relative to logical/keyspace/index scan/index page offset number order.
>
> We'll end up pinning each of the 4 or so heap pages more than once
> (sometimes several times each), when in principle we could have pinned
> each heap page exactly once. In other words, there is way too much of
> a difference between the case where the tuples we scan are *almost*
> perfectly clustered (which is what you see in my example) and the case
> where they're exactly perfectly clustered. In other other words, there
> is way too much of a difference between plain index scan, and bitmap
> index scan.
>
> (What I'm saying here is only true because this is a composite index
> and our query uses "like", returning rows matches a prefix -- if our
> index was on the column "typname" alone and we used a simple equality
> condition in our query then the Postgres 12 nbtree work would be
> enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
> that there are still relatively many important cases where we perform
> extra PinBuffer()/UnpinBuffer() calls during plain index scans that
> only touch one leaf page anyway.)
>
> Obviously we should expect bitmap index scans to have a natural
> advantage over plain index scans whenever there is little or no
> correlation -- that's clear. But that's not what we see here -- we're
> way too sensitive to minor imperfections in clustering that are
> naturally present on some kinds of leaf pages. The potential
> difference in pin/unpin traffic (relative to the bitmap index scan
> case) seems pathological to me. Ideally, we wouldn't have these kinds
> of differences at all. It's going to disrupt usage_count on the
> buffers.
>

I'm not sure I understand all the nuance here, but the thing I take away
is to add tests with different levels of correlation, and probably also
some multi-column indexes.

>>> It's important to carefully distinguish between cases where plain
>>> index scans really are at an inherent disadvantage relative to bitmap
>>> index scans (because there really is no getting around the need to
>>> access the same heap page many times with an index scan) versus cases
>>> that merely *appear* that way. Implementation restrictions that only
>>> really affect the plain index scan case (e.g., the lack of a
>>> reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
>>> should be accounted for when assessing the viability of index scan +
>>> prefetch over bitmap index scan + prefetch. This is very subtle, but
>>> important.
>>>
>>
>> I do agree, but what do you mean by "assessing"?
>
> I mean performance validation. There ought to be a theoretical model
> that describes the relationship between index scan and bitmap index
> scan, that has actual predictive power in the real world, across a
> variety of different cases. Something that isn't sensitive to the
> current phase of the moon (e.g., heap fragmentation along the lines of
> my pg_type_typname_nsp_index log output). I particularly want to avoid
> nasty discontinuities that really make no sense.
>
>> Wasn't the agreement at
>> the unconference session was we'd not tweak costing? So ultimately, this
>> does not really affect which scan type we pick. We'll keep doing the
>> same planning decisions as today, no?
>
> I'm not really talking about tweaking the costing. What I'm saying is
> that we really should expect index scans to behave similarly to bitmap
> index scans at runtime, for queries that really don't have much to
> gain from using a bitmap heap scan (queries that may or may not also
> benefit from prefetching). There are several reasons why this makes
> sense to me.
>
> One reason is that it makes tweaking the actual costing easier later
> on. Also, your point about plan robustness was a good one. If we make
> the wrong choice about index scan vs bitmap index scan, and the
> consequences aren't so bad, that's a very useful enhancement in
> itself.
>
> The most important reason of all may just be to build confidence in
> the design. I'm interested in understanding when and how prefetching
> stops helping.
>

Agreed.

>> I'm all for building a more comprehensive set of test cases - the stuff
>> presented at pgcon was good for demonstration, but it certainly is not
>> enough for testing. The SAOP queries are a great addition, I also plan
>> to run those queries on different (less random) data sets, etc. We'll
>> probably discover more interesting cases as the patch improves.
>
> Definitely.
>
>> There are two aspects why I think AM is not the right place:
>>
>> - accessing table from index code seems backwards
>>
>> - we already do prefetching from the executor (nodeBitmapHeapscan.c)
>>
>> It feels kinda wrong in hindsight.
>
> I'm willing to accept that we should do it the way you've done it in
> the patch provisionally. It's complicated enough that it feels like I
> should reserve the right to change my mind.
>
>>>> I think this is acceptable limitation, certainly for v0. Prefetching
>>>> across multiple leaf pages seems way more complex (particularly for the
>>>> cases using pairing heap), so let's leave this for the future.
>
>> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
>> to do that. But it seems like work for future someone.
>
> Right. You probably noticed that this is another case where we'd be
> making index scans behave more like bitmap index scans (perhaps even
> including the downsides for kill_prior_tuple that accompany not
> processing each leaf page inline). There is probably a point where
> that ceases to be sensible, but I don't know what that point is.
> They're way more similar than we seem to imagine.
>

OK. Thanks for all the comments.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2023-06-09 10:58:18 Re: Do we want a hashset type?
Previous Message Dagfinn Ilmari Mannsåker 2023-06-09 10:26:38 Re: Error in calculating length of encoded base64 string