Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-03-16 00:12:02
Message-ID: CAH2-WzkUaZesPygO4DCv4F7QG3BpMMfDmB42S+C_KPn-jaa+yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 8, 2024 at 9:00 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> I've attached v14, where 0001 is v13, 0002 is a patch with small
> changes + some large comment suggestions, and 0003 which contains
> sorted merge join code for _bt_merge_arrays.

I have accepted your changes from 0003. Agree that it's better that
way. It's at least a little faster, but not meaningfully more
complicated.

This is part of my next revision, v15, which I've attached (along with
a test case that you might find useful, explained below).

> I'll try to work a bit on v13/14's _bt_preprocess_keys, and see what I
> can make of it.

That's been the big focus of this new v15, which now goes all out with
teaching _bt_preprocess_keys with how to deal with arrays. We now do
comprehensive normalization of even very complicated combinations of
arrays and non-array scan keys in this version.

For example, consider this query:

select *
from multi_test
where
a = any('{1, 2, 3, 4, 5}'::int[])
and
a > 2::bigint
and
a = any('{2, 3, 4, 5, 6}'::bigint[])
and
a < 6::smallint
and
a = any('{2, 3, 4, 5, 6}'::smallint[])
and
a < 4::int;

This has a total of 6 input scankeys -- 3 of which are arrays. But by
the time v15's _bt_preprocess_keys is done with it, it'll have only 1
scan key -- which doesn't even have an array (not anymore). And so we
won't even need to advance the array keys one single time -- there'll
simply be no array left to advance. In other words, it'll be just as
if the query was written this way from the start:

select *
from multi_test
where
a = 3::int;

(Though of course the original query will spend more cycles on
preprocessing, compared to this manually simplified variant.)

In general, preprocessing can now simplify queries like this to the
maximum extent possible (without bringing the optimizer into it), no
matter how much crud like this is added -- even including adversarial
cases, with massive arrays that have some amount of
redundancy/contradictoriness to them.

It turned out to not be terribly difficult to teach
_bt_preprocess_keys everything it could possibly need to know about
arrays, so that it can operate on them directly, as a variant of the
standard equality strategy (we do still need _bt_preprocess_array_keys
for basic preprocessing of arrays, mostly just merging them). This is
better overall (in that it gets every little subtlety right), but it
also simplified a number of related issues. For example, there is no
longer any need to maintain a mapping between so->keyData[]-wise scan
keys (output scan keys), and scan->keyData[]-wise scan keys (input
scan keys). We can just add a step to fix-up the references to the end
of _bt_preprocess_keys, to make life easier within
_bt_advance_array_keys.

This preprocessing work should all be happening during planning, not
during query execution -- that's the design that makes the most sense.
This is something we've discussed in the past in the context of skip
scan (see my original email to this thread for the reference). It
would be especially useful for the very fancy kinds of preprocessing
that are described by the MDAM paper, like using an index scan for a
NOT IN() list/array (this can actually make sense with a low
cardinality index).

The structure for preprocessing that I'm working towards (especially
in v15) sets the groundwork for making those shifts in the planner,
because we'll no longer treat each array constant as its own primitive
index scan during preprocessing. Right now, on HEAD, preprocessing
with arrays kinda treats each array constant like the parameter of an
imaginary inner index scan, from an imaginary nested loop join. But
the planner really needs to operate on the whole qual, all at once,
including any arrays. An actual nestloop join's inner index scan
naturally cannot do that, and so might still require runtime/dynamic
preprocessing in a world where that mostly happens in the planner --
but that clearly not appropriate for arrays ("WHERE a = 5" and "WHERE
a in(4, 5)" are almost the same thing, and so should be handled in
almost the same way by preprocessing).

> > +_bt_rewind_nonrequired_arrays(IndexScanDesc scan, ScanDirection dir)
> [...]
> > + if (!(cur->sk_flags & SK_SEARCHARRAY) &&
> > + cur->sk_strategy != BTEqualStrategyNumber)
> > + continue;
>
> This should use ||, not &&

Fixed. Yeah, that was a bug.

> > + if (readpagetup || result != 0)
> > + {
> > + Assert(result != 0);
> > + return false;
> > + }
>
> I'm confused about this. By asserting !readpagetup after this exit, we
> could save a branch condition for the !readpagetup result != 0 path.
> Can't we better assert the inverse just below, or is this specifically
> for defense-in-depth against bug? E.g.
>
> + if (result != 0)
> + return false;
> +
> + Assert(!readpagetup);

Yeah, that's what it was -- defensively. It seems slightly better
as-is, because you'll get an assertion failure if a "readpagetup"
caller gets "result == 0". That's never supposed to happen (if it did
happen then our ORDER proc won't be in agreement with what our =
operator indicated about the same tuple attribute value moments
earlier, inside _bt_check_compare).

> > + /*
> > + * By here we have established that the scan's required arrays were
> > + * advanced, and that they haven't become exhausted.
> > + */
> > + Assert(arrays_advanced || !arrays_exhausted);
>
> Should use &&, based on the comment.

Fixed by getting rid of the arrays_exhausted variable, which wasn't
adding much anyway.

> > + * We generally permit primitive index scans to continue onto the next
> > + * sibling page when the page's finaltup satisfies all required scan keys
> > + * at the point where we're between pages.
>
> This should probably describe that we permit primitive scans with
> array keys to continue until we get to the sibling page, rather than
> this rather obvious and generic statement that would cover even the
> index scan for id > 0 ORDER BY id asc; or this paragraph can be
> removed.

It's not quite obvious. The scan's array keys change as the scan makes
progress, up to once per tuple read. But even if it really was
obvious, it's really supposed to frame the later discussion --
discussion of those less common cases where this isn't what happens.
The exceptions. These exceptions are:

1. When a required scan key is deemed "satisfied" only because its
value was truncated in a high key finaltup. (Technically this is what
_bt_checkkeys has always done, but we're much more sensitive to this
stuff now, because we won't necessarily get to make another choice
about starting a new primitive index scan for a long time.)

2. When we apply the has_required_opposite_direction_only stuff, and
decide to start a new primitive index scan, even though technically
all of our required-in-this-direction scan keys are still satisfied.

Separately, there is also the potential for undesirable interactions
between 1 and 2, which is why we don't let them mix. (We have the "if
(so->scanBehind && has_required_opposite_direction_only) goto
new_prim_scan" gating condition.)

> Further notes:
>
> I have yet to fully grasp what so->scanBehind is supposed to mean. "/*
> Scan might be behind arrays? */" doesn't give me enough hints here.

Yes, it is complicated. The best explanation is the one I've added to
_bt_readpage, next to the precheck. But that does need more work.

Note that the so->scanBehind thing solves two distinct problems for
the patch (related, but still clearly distinct). These problem are:

1. It makes the existing precheck/continuescan optimization in
_bt_readpage safe -- we'd sometimes get wrong answers to queries if we
didn't limit application of the optimization to !so->scanBehind cases.

I have a test case that proves that this is true -- the one I
mentioned in my introduction. I'm attaching that as
precheck_testcase.sql now. It might help you to understand
so->scanBehind, particularly this point 1 about the basic correctness
of the precheck thing (possibly point 2 also).

2. It is used to speculatively visit the next leaf page in corner
cases where truncated -inf attributes from the high key are deemed
"satisfied".

Generally speaking, we don't want to visit the next leaf page unless
we're already 100% sure that it might have matches (if any page has
matches for our current array keys at all, that is). But in a certain
sense we're only guessing. It isn't guaranteed (and fundamentally
can't be guaranteed) to work out once on the next sibling page. We've
merely assumed that our array keys satisfy truncated -inf columns,
without really knowing what it is that we'll find on the next page
when we actually visit it at that point (we're not going to see -inf
in the non-pivot tuples on the next page, we'll see some
real/non-sentinel low-sorting value).

We have good practical reasons to not want to treat them as
non-matching (though that would be correct), and to take this limited
gamble (we can discuss that some more if you'd like). Once you accept
that we have to do this, it follows that we need to be prepared to:

A. Notice that we're really just guessing in the sense I've described
(before leaving for the next sibling leaf page), by setting
so->scanBehind. We'll remember that it happened.

and:

B. Notice that that limited gamble didn't pay off once on the
next/sibling leaf page, so that we can cut our losses and start a new
primitive index scan at that point. We do this by checking
so->scanBehind (along with the sibling page's high key), once on the
sibling page. (We don't need explicit handling for the case when it
works out, as it almost always will.)

If you want to include discussion of problem 1 here too (not just
problem 2), then I should add a third thing that we need to notice:

C. Notice that we can't do the precheck thing once in _bt_readpage,
because it'd be wrong to allow it.

> I find it weird that we call _bt_advance_array_keys for non-required
> sktrig. Shouldn't that be as easy as doing a binary search through the
> array? Why does this need to hit the difficult path?

What difficult path?

Advancing non-required arrays isn't that different to advancing
required ones. We will never advance required arrays when called just
to advance a non-required one, obviously. But we can do the opposite.
In fact, we absolutely have to advance non-required arrays (as best we
can) when advancing a required one (or when the call was triggered by
a non-array required scan key).

That said, it could have been clearer than it was in earlier versions.
v15 makes the difference between the non-required scan key trigger and
required scan key trigger cases clearer within _bt_advance_array_keys.

--
Peter Geoghegan

Attachment Content-Type Size
precheck_testcase.sql application/octet-stream 2.1 KB
v15-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 176.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii.Yuki@df.MitsubishiElectric.co.jp 2024-03-16 02:28:50 RE: Partial aggregates pushdown
Previous Message Andrew Dunstan 2024-03-15 23:59:19 Re: Support json_errdetail in FRONTEND builds