Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2023-12-09 18:38:57
Message-ID: CAH2-WzkXJnHOuaonxm4xGPXU4n8D6VS2kzdmSy929nmiYYXBeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 20, 2023 at 6:52 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> It should be noted that the patch isn't strictly guaranteed to always
> read fewer index pages than master, for a given query plan and index.
> This is by design. Though the patch comes close, it's not quite a
> certainty. There are known cases where the patch reads the occasional
> extra page (relative to what master would have done under the same
> circumstances). These are cases where the implementation just cannot
> know for sure whether the next/sibling leaf page has key space covered
> by any of the scan's array keys (at least not in a way that seems
> practical). The implementation has simple heuristics that infer (a
> polite word for "make an educated guess") about what will be found on
> the next page. Theoretically we could be more conservative in how we
> go about this, but that seems like a bad idea to me. It's really easy
> to find cases where the maximally conservative approach loses by a
> lot, and really hard to show cases where it wins at all.

Attached is v8, which pretty much rips all of this stuff out.

I definitely had a point when I said that it made sense to be
optimistic about finding matches on the next page in respect of any
truncated -inf attributes in high keys, though. And so we still do
that much in v8. But, there is no reason why we need to go any further
than that -- there is no reason why we should *also* be optimistic
about *untruncated* high key/finaltup attributes that *aren't* exact
matches for any of the scan's required array keys finding exact
matches once we move onto the next sibling page.

I reached this conclusion when working on a fix for the low
cardinality index regression that Tomas' tests brought to my attention
[1]. I started out with the intention of just fixing that one case, in
a very targeted way, but quickly realized that it made almost no sense
to just limit myself to the low cardinality cases. Even with Tomas'
problematic low cardinality test cases, I saw untruncated high key
attributes that were "close by to matching tuples" -- they just
weren't close enough (i.e. exactly on the next leaf page) for us to
actually win (so v7 lost). Being almost correct again and again, but
still losing again and again is a good sign that certain basic
assumptions were faulty (at least if it's realistic, which it was in
this instance).

To be clear, and to repeat, even in v8 we'll still "make guesses"
about -inf truncated attributes. But it's a much more limited form of
guessing. If we didn't at least do the -inf thing, then backwards
scans would weirdly work better than forward scans in some cases -- a
particular concern with queries that have index quals for each of
multiple columns. I don't think that this remaining "speculative"
behavior needs to be discussed at very much length in code comments,
though. That's why v8 is a great deal simpler than v7 was here. No
more huge comment block at the end of the new _bt_checkkeys.

Notable stuff that *hasn't* changed from v7:

I'm posting this v8 having not yet worked through all of Heikki's
feedback. In particular, v8 doesn't deal with the relatively hard
question of what to do about the optimizations added by Alexander
Korotkov's commit e0b1ee17 (should I keep them disabled, selectively
re-enable one or both optimizations, or something else?). This is
partly due to at least one of the optimizations having problems of
their own that are still outstanding [2]. I also wanted to get a
revision out before travelling to Prague for PGConf.EU, which will be
followed by other Christmas travel. That's likely to keep me away from
the patch for weeks (that, plus I'm moving to NYC in early January).
So I just ran out of time to do absolutely everything.

Other notable changes:

* Bug fixes for cases where the array state machine gets confused by a
change in the scan direction, plus similar cases involving mark and
restore processing.

I'm not entirely happy with my approach here (mostly referring to the
new code in _bt_steppage here). Feels like it needs to be a little
better integrated with mark/restore processing.

No doubt Heikki will have his own doubts about this. I've included my
test cases for the issues in this area. The problems are really hard
to reproduce, and writing these tests took a surprisingly large amount
of effort. The tests might not be suitable for commit, but you really
need to see the test cases to be able to review the code efficiently.
It's just fiddly.

* I've managed to make the array state machine just a little more
streamlined compared to v7.

Minor code polishing, not really worth describing in detail.

* Addressed concerns about incomplete opfamilies not working with the
patch by updating the error message within
_bt_preprocess_array_keys/_bt_sort_array_cmp_setup. It now exactly
matches the similar one in _bt_first.

I don't think that we need any new opr_sanity tests, since we already
have one for this. Making the error message match the one in _bt_first
ensures that anybody that runs into a problem here will see the same
error message that they'd have seen on earlier versions, anyway.

It's a more useful error compared to the one from v7 (in that it names
the index and its attribute directly). Plus it's good to be
consistent.

I don't see any potential for the underlying _bt_sort_array_cmp_setup
behavior to be seen as a regression, in terms of our ability to cope
with incomplete opfamilies (compared to earlier Postgres versions).
Opfamilies that lack a cross-type ORDER proc mixed with queries that
use the corresponding cross-type = operator were always very dicey.
That situation isn't meaningfully different with the patch.

(Actually, this isn't 100% true in the case of queries + indexes with
non-required arrays specifically -- they'll need a 3-way comparator in
_bt_preprocess_array_keys/_bt_sort_array_cmp_setup, and yet *won't*
need one moments later, in _bt_first. This is because non-required
scan keys don't end up in _bt_first's insertion scan key at all, in
general. This distinction just seems pedantic, though. We're talking
about a case where things accidentally failed to fail in previous
versions, for some queries but not most queries. Now it'll fail in
exactly the same way in slightly more cases. In reality, affected
opfamilies are practically non-existent, so this is a hypothetical
upon a hypothetical.)

[1] https://postgr.es/m/CAH2-WzmtV7XEWxf_rP1pw=vyDjGLi__zGOy6Me5MovR3e1kfdg@mail.gmail.com
[2] https://postgr.es/m/CAH2-Wzn0LeLcb1PdBnK0xisz8NpHkxRrMr3NWJ+KOK-WZ+QtTQ@mail.gmail.com
--
Peter Geoghegan

Attachment Content-Type Size
v8-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 162.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-12-09 19:18:15 Re: Change GUC hashtable to use simplehash?
Previous Message Tomas Vondra 2023-12-09 18:08:20 Re: index prefetching