Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2023-11-21 02:52:32
Message-ID: CAH2-WzmTHoCsOmSgLg=yyft9LoERtuCKXyG2GZn+28PzonFA_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> Thanks. Here's my review of the btree-related code:

Attached is v7.

The main focus in v7 is making the handling of required
inequality-strategy scan keys more explicit -- now there is an
understanding of inequalities shared by _bt_check_compare (the
function that becomes the guts of _bt_checkkeys) and the new
_bt_advance_array_keys function/state machine. The big idea for v7 is
to generalize how we handle required equality-strategy scan keys
(always required in both scan directions), extending the same concept
to deal with required inequality strategy scan keys (only ever
required in one direction, which may or may not be the scan
direction).

This led to my discovering and fixing a couple of bugs related to
inequality handling. These issues were of the same general character
as many others I've dealt with before now: they involved subtle
confusion about when and how to start another primitive index scan,
leading to the scan reading many more pages than strictly necessary
(potentially many more than master). In other words, cases where we
didn't give up and start another primitive index scan, even though
(with a repro of the issue) it's obviously not sensible. An accidental
full index scan.

While I'm still not completely happy with the way that inequalities
are handled, things in this area are much improved in v7.

It should be noted that the patch isn't strictly guaranteed to always
read fewer index pages than master, for a given query plan and index.
This is by design. Though the patch comes close, it's not quite a
certainty. There are known cases where the patch reads the occasional
extra page (relative to what master would have done under the same
circumstances). These are cases where the implementation just cannot
know for sure whether the next/sibling leaf page has key space covered
by any of the scan's array keys (at least not in a way that seems
practical). The implementation has simple heuristics that infer (a
polite word for "make an educated guess") about what will be found on
the next page. Theoretically we could be more conservative in how we
go about this, but that seems like a bad idea to me. It's really easy
to find cases where the maximally conservative approach loses by a
lot, and really hard to show cases where it wins at all.

These heuristics are more or less a limited form of the heuristics
that skip scan would need. A *very* limited form. We're still
conservative. Here's how it works, at a high level: if the scan can
make it all the way to the end of the page without having to start a
new primitive index scan (before reaching the end), and then finds
that "finaltup" itself (which is usually the page high key) advances
the array keys, we speculate: we move on to the sibling page. It's
just about possible that we'll discover (once on the next page) that
finaltup actually advanced the array keys by so much (in one single
advancement step) that the current/new keys cover key space beyond the
sibling page we just arrived at. The sibling page access will have
been wasted (though I prefer to think of it as a cost of doing
business).

I go into a lot of detail on the trade-offs in this area in comments
at the end of the new _bt_checkkeys(), just after it calls
_bt_advance_array_keys(). Hopefully this is reasonably clear. It's
always much easier to understand these things when you've written lots
of test cases, though. So I wouldn't at all be surprised to hear that
my explanation needs more work. I suspect that I'm spending more time
on the topic than it actually warrants, but you have to spend a lot of
time on it for yourself to be able to see why that is.

> > +++ b/src/backend/access/nbtree/nbtsearch.c
> > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
> > * set flag to true if all required keys are satisfied and false
> > * otherwise.
> > */
> > - (void) _bt_checkkeys(scan, itup, indnatts, dir,
> > - &requiredMatchedByPrecheck, false);
> > + _bt_checkkeys(scan, &pstate, itup, false, false);
> > + requiredMatchedByPrecheck = pstate.continuescan;
> > + pstate.continuescan = true; /* reset */
>
> The comment above the updated section needs to be updated.

Updated.

> > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
> > * set flag to true if all required keys are satisfied and false
> > * otherwise.
> > */
> > - (void) _bt_checkkeys(scan, itup, indnatts, dir,
> > - &requiredMatchedByPrecheck, false);
> > + _bt_checkkeys(scan, &pstate, itup, false, false);
>
> This 'false' finaltup argument is surely wrong for the rightmost
> page's rightmost tuple, no?

Not in any practical sense. Since finaltup means "the tuple that you
should use to decide whether to go to the next page or not", and a
rightmost page doesn't have a next page.

There are exactly two ways that the top-level scan can end (not to be
confused with the primitive scan), at least in v7. They are:

1. The state machine can exhaust the scan's array keys, ending the
top-level scan.

2. The scan can just run out of pages, without ever running out of
array keys (some array keys can sort higher than any real value from
the index). This is just like how an index scan ends when it lacks any
required scan keys to terminate the scan, and eventually runs out of
pages to scan (think of an index-only scan that performs a full scan
of the index, feeding into a group aggregate).

Note that it wouldn't be okay if the design relied on _bt_checkkeys
advancing and exhausting the array keys -- we really do need both 1
and 2 to deal with various edge cases. For example, there is no way
that we'll ever be able to call _bt_checkkeys with a completely empty
index. It simply doesn't have any tuples at all. In fact, it doesn't
even have any pages (apart from the metapage), so clearly we can't
expect any calls to _bt_readpage (much less _bt_checkkeys).

> > +++ b/src/backend/access/nbtree/nbtutils.c
> > @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> > + /* We could pfree(elem_values) after, but not worth the cycles */
> > + num_elems = _bt_merge_arrays(scan, cur,
> > + (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0,
> > + prev->elem_values, prev->num_elems,
> > + elem_values, num_elems);
>
> This code can get hit several times when there are multiple = ANY
> clauses, which may result in repeated leakage of these arrays during
> this scan. I think cleaning up may well be worth the cycles when the
> total size of the arrays is large enough.

They won't leak because the memory is allocated in the same dedicated
memory context.

That said, I added a pfree(). It couldn't hurt.

> > @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey,
> > _bt_compare_array_elements, &cxt);
> > +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse,
> > + Datum *elems_orig, int nelems_orig,
> > + Datum *elems_next, int nelems_next)
> > [...]
> > + /*
> > + * Incrementally copy the original array into a temp buffer, skipping over
> > + * any items that are missing from the "next" array
> > + */
>
> Given that we only keep the members that both arrays have in common,
> the result array will be a strict subset of the original array. So, I
> don't quite see why we need the temporary buffer here - we can reuse
> the entries of the elems_orig array that we've already compared
> against the elems_next array.

This code path is only hit when the query was written on autopilot,
since it must have contained redundant SAOPs for the same index column
-- a glaring inconsistency. Plus these arrays just aren't very big in
practice (despite my concerns about huge arrays). Plus there is only
one of these array-specific preprocessing steps per btrescan. So I
don't think that it's worth going to too much trouble here.

> We may want to optimize this further by iterating over only the
> smallest array: With the current code, [1, 2] + [1....1000] is faster
> to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much
> smaller than 1000*log(2). In practice this may matter very little,
> though.
> An even better optimized version would do a merge join on the two
> arrays, rather than loop + binary search.

v7 allocates the temp buffer using the size of whatever array is the
smaller of the two, just because it's an easy marginal improvement.

> > @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg)
> > [...]
> > +_bt_binsrch_array_skey(FmgrInfo *orderproc,
>
> Is there a reason for this complex initialization of high/low_elem,
> rather than the this easier to understand and more compact
> initialization?:
>
> + low_elem = 0;
> + high_elem = array->num_elems - 1;
> + if (cur_elem_start)
> + {
> + if (ScanDirectionIsForward(dir))
> + low_elem = array->cur_elem;
> + else
> + high_elem = array->cur_elem;
> + }

I agree that it's better your way. Done that way in v7.

> I think the conditional _bt_parallel_done(scan) feels misplaced here,
> as the comment immediately below indicates the scan is to be
> terminated after that comment. So, either move this _bt_parallel_done
> call outside the function (which by name would imply it is read-only,
> without side effects like this) or move it below the comment
> "terminate the top-level scan".

v7 moves the comment up, so that it's just before the _bt_parallel_done() call.

> > +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
> > [...]
> > + * Set up ORDER 3-way comparison function and array state
> > [...]
> > + * Optimization: Skip over non-required scan keys when we know that
>
> These two sections should probably be swapped, as the skip makes the
> setup useless.

Not quite: we need to increment arrayidx for later loop iterations/scan keys.

> Also, the comment here is wrong; the scan keys that are skipped are
> 'required', not 'non-required'.

Agreed. Fixed.

> > +++ b/src/test/regress/expected/join.out
> > @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]);
> > Merge Cond: (j1.id1 = j2.id1)
> > Join Filter: (j2.id2 = j1.id2)
> > -> Index Scan using j1_id1_idx on j1
> > - -> Index Only Scan using j2_pkey on j2
> > + -> Index Scan using j2_id1_idx on j2
> > Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
> > - Filter: ((id1 % 1000) = 1)
> > -(7 rows)
> > +(6 rows)
>
> I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter
> anymore. The output otherwise matches (quite possibly because the
> other join conditions don't match) and I don't have time to
> investigate the intricacies between IOS vs normal IS, but this feels
> off.

This happens because the new plan uses a completely different index --
which happens to be a partial index whose predicate exactly matches
the old plan's filter quals. That factor makes the filter quals
unnecessary. That's all this is.

> As for the planner changes, I don't think I'm familiar enough with the
> planner to make any authorative comments on this. However, it does
> look like you've changed the meaning of 'amsearcharray', and I'm not
> sure it's OK to assume all indexes that support amsearcharray will
> also support for this new assumption of ordered retrieval of SAOPs.
> For one, the pgroonga extension [0] does mark
> amcanorder+amsearcharray.

The changes that I've made to the planner are subtractive. We more or
less go back to how things were just after the initial work on nbtree
amsearcharray support. That work was (at least tacitly) assumed to
have no impact on ordered scans. Because why should it? What other
type of index clause has ever affected what seems like a rather
unrelated thing (namely the sort order of the scan)? The oversight was
understandable. The kinds of plans that master cannot produce output
for in standard index order are really silly plans, independent of
this issue; it makes zero sense to allow a non-required array scan key
to affect how or when we skip.

The code that I'm removing from the planner is code that quite
obviously assumes nbtree-like behavior. So I'm taking away code like
that, rather than adding new code like that. That said, I am really
surprised that any extension creates an index AM amcanorder=true (not
to be confused with amcanorderbyop=true, which is less surprising).
That means that it promises the planner that it behaves just like
nbtree. To quote the docs, it must have "btree-compatible strategy
numbers for their [its] equality and ordering operators". Is that
really something that pgroonga even attempts? And if so, why?

I also find it bizarre that pgroonga's handler-stated capabilities
include "amcanunique=true". So pgroonga is a full text search engine,
but also supports unique indexes? I find that particularly hard to
believe, and suspect that the way that they set things up in the AM
handler just isn't very well thought out.

--
Peter Geoghegan

Attachment Content-Type Size
v7-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 126.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-11-21 03:45:55 Re: Add recovery to pg_control and remove backup_label
Previous Message Tom Lane 2023-11-21 02:13:01 Re: About #13489, array dimensions and CREATE TABLE ... LIKE