Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-03-02 01:30:15
Message-ID: CAH2-Wzm=kTenodJ2Xc3+p4Kp+0AC5MGR7kQGe0fGOXJOJO0bCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 15, 2024 at 6:36 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v11, which now says something like that in the commit
> message.

Attached is v12.

> All that really remains now is to research how we might integrate this
> work with the recently added continuescanPrechecked/haveFirstMatch
> stuff from Alexander Korotkov, if at all.

The main change in v12 is that I've integrated both the
continuescanPrechecked and the haveFirstMatch optimizations. Both of
these fields are now page-level state, shared between the _bt_readpage
caller and the _bt_checkkeys/_bt_advance_array_keys callees (so they
appear next to the new home for _bt_checkkeys' continuescan field, in
the new page state struct).

The haveFirstMatch optimization (not a great name BTW) was the easier
of the two to integrate. I just invalidate the flag (set it to false)
when the array keys advance. It works in exactly the same way as in
the simple no-arrays case, except that there can be more than one
"first match" per page. (This approach was directly enabled by the new
design that came out of Alexander's bugfix commit 7e6fb5da)

The continuescanPrechecked optimization was trickier. The hard part
was avoiding confusion when the start of matches for the current set
of array keys starts somewhere in the middle of the page -- that needs
to be a gating condition on applying the optimization, applied within
_bt_readpage.

continuescanPrechecked
======================

To recap, this precheck optimization is predicated on the idea that
the precheck _bt_checkkeys call tells us all we need to know about
required-in-same-direction scan keys for every other tuple on the page
(assuming continuescan=true is set during the precheck). Critically,
this assumes that there can be no confusion between "before the start
of matching tuples" and "after the end of matching tuples" (in the
presence of required equality strategy scan keys). In other words, it
tacitly assumes that we can't break a very old rule that says that
_bt_first must never call _bt_readpage with an offnum that's before
the start of the first match (the leaf page position located by our
insertion scan key). Although it isn't obvious that this is relevant
at all right now (partly because the _bt_first call to _bt_readpage
won't even use the optimization on the grounds that to do so would
regress point queries), it becomes more obvious once my patch is added
to the mix.

To recap again, making the arrays advance dynamically necessarily
requires that I teach _bt_checkkeys to avoid confusing "before the
start of matching tuples" with "after the end of matching tuples" --
we must give _bt_checkkeys a set of 3-way ORDER procs (and the
necessary context) to avoid that confusion -- even for any non-array
required equality keys.

Putting it all together (having laid the groundwork with my recaps):
This means that it is only safe (for my patch) to even attempt the
precheck optimization when we already know that the array keys cover
keyspace at the start of the page (and if the attempt actually
succeeds, it succeeds because the current array keys cover the entire
page). And so in v12 I track that state in so->scanBehind. In
practice, we only rarely need to avoid the continuescanPrechecked
optimization as a result of this gating condition -- it hardly reduces
the effectiveness of the optimization at all. In fact, it isn't even
possible for so->scanBehind to ever be set during a backward scan.

Saving cycles in_bt_checkkeys with so->scanBehind
-------------------------------------------------

It turns out that this so->scanBehind business is generally useful. We
now expect _bt_advance_array_keys to establish whether or not the
_bt_readpage-wise scan will continue to the end of the current leaf
page up-front, each time the arrays are advanced (advanced past the
tuple being scanned by _bt_readpage). When _bt_advance_array_keys
decides to move onto the next page, it might also set so->scanBehind.
This structure allowed me to simplify the hot code paths within
_bt_checkkeys. Now _bt_checkkeys doesn't need to check the page high
key (or check the first non-pivot tuple in the case of backwards
scans) at all in the very common case where so->scanBehind hasn't been
set by _bt_advance_array_keys.

I mentioned that only forward scans can ever set the so->scanBehind
flag variable. That's because only forward scans can advance the array
keys using the page high key, which (due to suffix truncation) creates
the possibility that the next _bt_readpage will start out on a page
whose earlier tuples are before the real start of matches (for the
current array keys) -- see the comments in _bt_advance_array_keys for
the full explanation.

so->scanBehind summary
----------------------

In summary, so->scanBehind serves two quite distinct purposes:

1. Makes it safe to apply the continuescanPrechecked optimization
during scans that happen to use array keys -- with hardly any changes
required to _bt_readpage to make this work (just testing the
so->scanBehind flag).

2. Gives _bt_checkkeys() a way to know when to check if an earlier
speculative choice (by _bt_advance_array_keys) to move on to the
current leaf page without it yet being 100% clear that its key space
has exact matches for all the scan's equality/array keys (which is
only possible when suffix truncation obscures the issue, hence the
thing about so->scanBehind only ever being set during forward scans).

This speculative choice can only be made during forward scans of a
composite index, whose high key has one or more truncated suffix
attributes that correspond to a required scan key.
_bt_advance_array_keys deems that the truncated attribute "satisfies"
the scan key, but remembers that it wasn't strictly speaking
"satisfied", by setting so->scanBehind (so it is "satisfied" with the
truncated attributes being a "match", but not so satisfied that it's
willing to allow it without also giving _bt_checkkeys a way to back
out if it turns out to have been the wrong choice once on the next
page).

_bt_preprocess_keys contradictory key array advancement
=======================================================

The other big change in v12 concerns _bt_preprocess_keys (not to be
confused with _bt_preprocess_array_keys). It has been taught to treat
the scan's array keys as a distinct type of scan key for the purposes
of detecting redundant and contradictory scan keys. In other words, it
treats scan keys with arrays as their own special type of equality
scan key -- it doesn't treat them as just another type of equality
strategy key in v12. In other other words, _bt_preprocess_keys doesn't
"work at the level of individual primitive index scans" in v12.

Technically this is an optimization that targets cases with many
distinct sets of array keys that have contradictory quals. But I
mostly did this because it has what I now consider to be far better
code structure than what we had in previous versions (I changed my
mind, having previously considered the changes I'd made to
_bt_preprocess_keys for arrays to be less than desirable). And because
it defensively makes the B-Tree code tolerant of scans that have an
absurdly large number of distinct sets of array keys (e.g., 3 arrays
with 1,000 elements, which gives us a billion distinct sets of array
keys in total).

In v12 an index scan can use (say) a billion distinct sets of array
keys, and still expect optimal performance -- even in the corner case
where (for whatever reason) the scan's quals also happen to be
contradictory. There is no longer any risk of calling
_bt_preprocess_keys a billion times, making the query time explode,
all because somebody added an "a = 5 AND a = 6" to the end of the
query's qual. In fact, there isn't ever a need to call
_bt_preprocess_keys() more than once, no matter the details of the
scan -- not now that _bt_preprocess_keys literally operates on whole
arrays as a separate thing to other equality keys. Once you start to
think in terms of array scan keys (not primitive index scans), it
becomes obvious that a single call to _bt_preprocess_keys is all you
really need,

(Note: While we actually do still always call _bt_preprocess_keys from
_bt_first, every call to _bt_preprocess_keys after the first one can
now simply reuse the existing scan keys that were output during the
first call. So effectively there's only one call to
_bt_preprocess_keys required per top-level scan.)

Potential downsides
-------------------

Admittedly, this new approach in _bt_preprocess_keys has some
potential downsides. _bt_preprocess_keys is now largely incapable of
treating quals like "WHERE a IN(1,2,3) AND a = 2" as "partially
contradictory". What it'll actually do is allow the qual to contain
duplicative scan keys (just like when we can't detect redundancy due
to a lack of cross-type support), and leave it all up to
_bt_checkkeys/_bt_advance_array_keys to figure it out later on. (At
one point I actually taught _bt_preprocesses_keys to perform
"incremental advancement" of the scan's arrays itself, which worked,
but that still left the code vulnerable to having to call
_bt_preprocesses_keys an enormous number of times in extreme cases.
That was deemed unacceptable, because it still seemed like a POLA
violation.)

This business with just leaving in some extra scan keys might seem a
little bit sloppy. I thought so myself, for a while, but now I think
that it's actually fine. For the following reasons:

* Even with arrays, contradictory quals like "WHERE a = 4 AND a = 5"
are still detected as contradictory, while quals like "WHERE a = 5 and
a = 5" are still detected as containing a redundancy. We'll only "fail
to detect redundancy/contradictoriness" with cases such as "WHERE a IN
(1, 2, 3) and a = 2" -- cases that (once you buy into this new way of
thinking about arrays and preprocessing) aren't really
contradictory/redundant at all.

* We have _bt_preprocess_array_keys, which was first taught to merge
together redundant array keys in an earlier version of this patch. So,
for example, quals like "WHERE a IN (1, 2, 3) AND a IN (3,4,5)" can be
preprocessed into "WHERE a in (3)". Plus quals like "WHERE a IN (1, 2,
3) AND a IN (18,72)" can be ruled contradictory there and then, in
_bt_preprocess_array_keys , ending the scan immediately.

These cases seem like the really important ones for array
redundancy/contradictoriness. We're not going to miss these array
redundancies.

* The new approach to array key advancement is very good at quickly
eliminating redundant subsets of the array keys that couldn't be
detected as such by preprocessing, due to a lack of suitable
infrastructure. While there is some downside from allowing
preprocessing to fail to detect "partially contradictory" quals, the
new code is so good at advancing the array keys efficiently at runtime
the added overhead is hardly noticeable at all. We're talking maybe
one or two extra descents of the index, for just a subset of all badly
written queries that show some kind of redundancy/contradictoriness
involving array scan keys.

Even if somebody takes the position that this approach isn't
acceptable (not that I expect that anybody will), then the problem
won't be that I've gone too far. The problem will just be that I
haven't gone far enough. If it somehow becomes a blocker to commit,
I'd rather handle the problem by compensating for the minor downsides
that the v12 _bt_preprocess_keys changes arguably create, by adding
more types of array-specific preprocessing to
_bt_preprocess_array_keys. You know, preprocessing that can actually
eliminate a subset of array keys given a qual like "WHERE a IN (1, 2,
3) AND a < 2".

To be clear, I really don't think it's worth adding new types of array
preprocessing to _bt_preprocess_array_keys, just for these cases --
my current approach works just fine, by any relevant metric (even if
you assume that the sorts of array redundancy/contradictoriness we can
miss out on are common and important, it's small potatoes). Just
bringing the idea of adding more stuff to _bt_preprocess_array_keys to
the attention of the group, as an option (this idea might also provide
useful context, that makes my high level thinking on this a bit easier
to follow).

--
Peter Geoghegan

Attachment Content-Type Size
v12-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 157.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-03-02 03:51:18 RE: Synchronizing slots from primary to standby
Previous Message Euler Taveira 2024-03-02 00:41:41 Re: Avoiding inadvertent debugging mode for pgbench