Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-01-22 21:12:55
Message-ID: CAEze2Wj7tSTfTy5W0EKRQ-GReYRFaBTbPPnDxXJ8TZttoiZ1kA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>

Thank you for your replies so far.

> On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > I would agree with you if this was about new APIs and features, but
> > here existing APIs are being repurposed without changing them. A
> > maintainer of index AMs would not look at the commit title 'Enhance
> > nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> > my existing amsearcharray+amcanorder handling actually supports
> > non-prefix arrays keys and return data in index order".
>
> This is getting ridiculous.
>
> It is quite likely that there are exactly zero affected out-of-core
> index AMs. I don't count pgroonga as a counterexample (I don't think
> that it actually fullfills the definition of a ). Basically,
> "amcanorder" index AMs more or less promise to be compatible with
> nbtree, down to having the same strategy numbers. So the idea that I'm
> going to upset amsearcharray+amcanorder index AM authors is a
> completely theoretical problem. The planner code evolved with nbtree,
> hand-in-glove.

And this is where I disagree with your (percieved) implicit argument
that this should be and always stay this way. I don't mind changes in
the planner for nbtree per se, but as I've mentioned before in other
places, I really don't like how we handle amcanorder as if it were
amisbtree. But it's not called that, so we shouldn't expect that to
remain the case; and definitely not keep building on those
expectations that it always is going to be the nbtree when amcanorder
is set (or amsearcharray is set, or ..., or any combination of those
that is currently used by btree). By keeping that expectation alive,
this becomes a self-fulfilling prophecy, and I really don't like such
restrictive self-fulfilling prophecies. It's nice that we have index
AM feature flags, but if we only effectively allow one implementation
for this set of flags by ignoring potential other users when changing
behavior, then what is the API good for (apart from abstraction
layering, which is nice)?

/rant

I'll see about the

> I'm more than happy to add a reminder to the commit message about
> adding a proforma listing to the compatibility section of the Postgres
> 17 release notes. Can we actually agree on which theoretical index AM
> types are affected first, though? Isn't it actually
> amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
> have that right?

I think that may be right, but I could very well have built a
btree-lite that doesn't have the historical baggage of having to deal
with pages from before v12 (version 4) and some improvements that
haven't made it to core yet.

> BTW, how should we phrase this compatibility note, so that it can be
> understood? It's rather awkward.

Something like the following, maybe?

"""
Compatibility: The planner will now generate paths with array scan
keys in any column for ordered indexes, rather than on only a prefix
of the index columns. The planner still expects fully ordered data
from those indexes.
Historically, the btree AM couldn't output correctly ordered data for
suffix array scan keys, which was "fixed" by prohibiting the planner
from generating array scan keys without an equality prefix scan key up
to that attribute. In this commit, the issue has been fixed, and the
planner restriction has thus been removed as the only in-core IndexAM
that reports amcanorder now supports array scan keys on any column
regardless of what prefix scan keys it has.
"""

> > > As I said to Heikki, thinking about this some more is on my todo list.
> > > I mean the way that this worked had substantial revisions on HEAD
> > > right before I posted v9. v9 was only to fix the bit rot that that
> > > caused.
> >
> > Then I'll be waiting for that TODO item to be resolved.
>
> My TODO item is to resolve the question of whether and to what extent
> these two optimizations should be combined. It's possible that I'll
> decide that it isn't worth it, for whatever reason.

That's fine, this decision (and any work related to it) is exactly
what I was referring to with that mention of the TODO.

> > > Even single digit
> > > thousand of array elements should be considered huge. Plus this code
> > > path is only hit with poorly written queries.
> >
> > Would you accept suggestions?
>
> You mean that you want to have a go at it yourself? Sure, go ahead.
>
> I cannot promise that I'll accept your suggested revisions, but if
> they aren't too invasive/complicated compared to what I have now, then
> I will just accept them. I maintain that this isn't really necessary,
> but it might be the path of least resistance at this point.

Attached 2 patches: v11.patch-a and v11.patch-b. Both are incremental
on top of your earlier set, and both don't allocate additional memory
in the merge operation in non-assertion builds.

patch-a is a trivial and clean implementation of mergesort, which
tends to reduce the total number of compare operations if both arrays
are of similar size and value range. It writes data directly back into
the main array on non-assertion builds, and with assertions it reuses
your binary search join on scratch space for algorithm validation.

patch-b is an improved version of patch-a with reduced time complexity
in some cases: It applies exponential search to reduce work done where
there are large runs of unmatched values in either array, and does
that while keeping O(n+m) worst-case complexity, now getting a
best-case O(log(n)) complexity.

> > > > I'm also no fan of the (tail) recursion. I would agree that this is
> > > > unlikely to consume a lot of stack, but it does consume stackspace
> > > > nonetheless, and I'd prefer if it was not done this way.
> > >
> > > I disagree. The amount of stack space it consumes in the worst case is fixed.
> >
> > Is it fixed? Without digging very deep into it, it looks like it can
> > iterate on the order of n_scankeys deep? The comment above does hint
> > on 2 iterations, but is not very clear about the conditions and why.
>
> The recursive call to _bt_advance_array_keys is needed to deal with
> unsatisfied-and-required inequalities that were not detected by an
> initial call to _bt_check_compare, following a second retry call to
> _bt_check_compare. We know that this recursive call to
> _bt_advance_array_keys won't cannot recurse again because we've
> already identified which specific inequality scan key it is that's a
> still-unsatisfied inequality, following an initial round of array key
> advancement.

So, it's a case of this?

scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
(2 (=current), 3, 4)
tuple: a=2, b=4

We first match the 'a' inequality key, then find an array-key mismatch
breaking out, then move the array keys forward to a=2 (of ANY
(1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later
b < 4 inequality key, as that required inequality key wasn't checked
because the earlier arraykey did not match in _bt_check_compare,
causing it to stop at the a=1 condition as opposed to check the b < 4
condition?

If so, then this visual explanation helped me understand the point and
why it can't repeat more than once much better than all that text.
Maybe this can be integrated?

> This is a narrow edge-case -- it is an awkward one. Recursion does
> seem natural here, because we're essentially repeating the call to
> _bt_advance_array_keys from inside _bt_advance_array_keys, rather than
> leaving it up to the usual external caller to do later on. If you
> ripped this code out it would barely be noticeable at all. But it
> seems worth it so that we can make a uniform promise to always advance
> the array keys to the maximum extent possible, based on what the
> caller's tuple tells us about the progress of the scan.

Agreed on the awkwardness and recursion.

> > > Because in one case we follow the convention for scan keys, whereas
> > > so->orderProcs is accessed via an attribute number subscript.
> >
> > Okay, but how about this in _bt_merge_arrays?
> >
> > + Datum *elem = elems_orig + i;
> >
> > I'm not familiar with the scan key convention, as most other places
> > use reference+subscripting.
>
> I meant the convention used in code like _bt_check_compare (which is
> what we call _bt_checkkeys on HEAD, basically).
>
> Note that the _bt_merge_arrays code that you've highlighted isn't
> iterating through so->keyData[] -- it is iterating through the
> function caller's elements array, which actually come from
> so->arrayKeys[].

It was exactly why I started asking about the use of pointer
additions: _bt_merge_arrays is new code and I can't think of any other
case of this style being used in new code, except maybe when
surrounding code has this style. The reason I first highlighted
_bt_update_keys_with_arraykeys is because it had a clear case of 2
different styles in the same function.

> Like every other Postgres contributor, I do my best to follow the
> conventions established by existing code. Sometimes that leads to
> pretty awkward results, where CamelCase and underscore styles are
> closely mixed together, because it works out to be the most consistent
> way of doing it overall.

I'm slightly surprised by that, as after this patch I can't find any
code that wasn't touched by this patch in nbtutils that uses + for
pointer offsets.
Either way, that's the style for indexing into a ScanKey, apparently?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment Content-Type Size
v11-0001-nbtree-merge-array-scankeys-s-arrays-more-effici.patch-b application/octet-stream 7.5 KB
v11-0001-nbtree-merge-array-scankeys-s-arrays-more-effici.patch-a application/octet-stream 4.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-01-22 21:19:36 Re: core dumps in auto_prewarm, tests succeed
Previous Message Andres Freund 2024-01-22 21:07:40 Re: Refactoring backend fork+exec code