Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2023-09-17 23:47:45
Message-ID: CAH2-WzkEyBU9UQM-5GWPcB=WEShAUKcJdvgFuqVHuPuO-iYW0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 26, 2023 at 6:41 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > MDAM seems to require exponential storage for "scan key operations"
> > for conditions on N columns (to be precise, the product of the number
> > of distinct conditions on each column); e.g. an index on mytable
> > (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> > AND h IN (1, 2)" would require 2^8 entries.

> What you describe is a problem in theory, but I doubt that it's a
> problem in practice. You don't actually have to materialize the
> predicates up-front, or at all. Plus you can skip over them using the
> next index tuple. So skipping works both ways.

Attached is v2, which makes all array key advancement take place using
the "next index tuple" approach (using binary searches to find array
keys using index tuple values). This approach was necessary for fairly
mundane reasons (it limits the amount of work required while holding a
buffer lock), but it also solves quite a few other problems that I
find far more interesting.

It's easy to imagine the state machine from v2 of the patch being
extended for skip scan. My approach "abstracts away" the arrays. For
skip scan, it would more or less behave as if the user had written a
query "WHERE a in (<Every possible value for this column>) AND b = 5
... " -- without actually knowing what the so-called array keys for
the high-order skipped column are (not up front, at least). We'd only
need to track the current "array key" for the scan key on the skipped
column, "a". The state machine would notice when the scan had reached
the next-greatest "a" value in the index (whatever that might be), and
then make that the current value. Finally, the state machine would
effectively instruct its caller to consider repositioning the scan via
a new descent of the index. In other words, almost everything for skip
scan would work just like regular SAOPs -- and any differences would
be well encapsulated.

But it's not just skip scan. This approach also enables thinking of
SAOP index scans (using nbtree) as just another type of indexable
clause, without any special restrictions (compared to true indexable
operators such as "=", say). Particularly in the planner. That was
always the general thrust of teaching nbtree about SAOPs, from the
start. But it's something that should be totally embraced IMV. That's
just what the patch proposes to do.

In particular, the patch now:

1. Entirely removes the long-standing restriction on generating path
keys for index paths with SAOPs, even when there are inequalities on a
high order column present. You can mix SAOPs together with other
clause types, arbitrarily, and everything still works and works
efficiently.

For example, the regression test expected output for this query/test
(from bugfix commit 807a40c5) is updated by the patch, as shown here:

explain (costs off)
SELECT thousand, tenthous FROM tenk1
WHERE thousand < 2 AND tenthous IN (1001,3000)
ORDER BY thousand;
- QUERY PLAN
---------------------------------------------------------------------------------------
- Sort
- Sort Key: thousand
- -> Index Scan using tenk1_thous_tenthous on tenk1
- Index Cond: ((thousand < 2) AND (tenthous = ANY
('{1001,3000}'::integer[])))
-(4 rows)
+ QUERY PLAN
+--------------------------------------------------------------------------------
+ Index Scan using tenk1_thous_tenthous on tenk1
+ Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[])))
+(2 rows)

We don't need a sort node anymore -- even though the leading column
here (thousand) uses an inequality, a particularly tricky case. Now
it's an index scan, much like any other, with no particular
restrictions caused by using a SAOP.

2. Adds an nbtree strategy for non-required equality array scan keys,
which is built on the same state machine, with only minor differences
to deal with column values "appearing out of key space order".

3. Simplifies the optimizer side of things by consistently avoiding
filter quals (except when it's truly unavoidable). The optimizer
doesn't even consider alternative index paths with filter quals for
lower-order SAOP columns, because they have no possible advantage
anymore. On the other hand, as we saw already, upthread, filter quals
have huge disadvantages. By always using true index quals, we
automatically avoid any question of getting excessive amounts of heap
page accesses just to eliminate non-matching rows. AFAICT we don't
need to make a trade-off here.

The first version of the patch added some crufty code to the
optimizer, to account for various restrictions on sort order. This
revised version actually removes existing cruft from the same place
(indxpath.c) instead.

Items 1, 2, and 3 are all closely related. Take the query I've shown
for item 1. Bugfix commit 807a40c5 (which added the test query in
question) dealt with an oversight in the then-recent original nbtree
SAOP patch (commit 9e8da0f7): when nbtree combines two primitive index
scans with an inequality on their leading column, we cannot be sure
that the output will appear in the same order as the order that one
big continuous index scan returns rows in. We can only expect to
maintain the illusion that we're doing one continuous index scan when
individual primitive index scans access earlier columns via the
equality strategy -- we need "equality constraints".

In practice, the optimizer (indxpath.c) is very conservative (more
conservative than it really needs to be) when it comes to trusting the
index scan to output rows in index order, in the presence of SAOPs.
All of that now seems totally unnecessary. Again, I don't see a need
to make a trade-off here.

My observation about this query (and others like it) is: why not
literally perform one continuous index scan instead (not multiple
primitive index scans)? That is strictly better, given all the
specifics here. Once we have a way to do that (which the nbtree
executor work listed under item 2 provides), it becomes safe to assume
that the tuples will be output in index order -- there is no illusion
left to preserve. Who needs an illusion that isn't actually helping
us? We actually do less I/O by using this strategy, for the usual
reasons (we can avoid repeating index page accesses).

A more concrete benefit of the non-required-scankeys stuff can be seen
by running Benoit Tigeot's test case [1] with v2. He had a query like
this:

SELECT * FROM docs
WHERE status IN ('draft', 'sent') AND
sender_reference IN ('Custom/1175', 'Client/362', 'Custom/280')
ORDER BY sent_at DESC NULLS LAST LIMIT 20;

And, his test case had an index on "sent_at DESC NULLS LAST,
sender_reference, status". This variant was a weak spot for v1.

v2 of the patch is vastly more efficient here, since we don't have to
go to the heap to eliminate non-matching tuples -- that can happen in
the index AM instead. This can easily be 2x-3x faster on a warm cache,
and have *hundreds* of times fewer buffer accesses (which Benoit
verified with an early version of this v2). All because we now require
vastly less heap access -- the quals are fairly selective here, and we
have to scan hundreds of leaf pages before the scan can terminate.
Avoiding filter quals is a huge win.

This particular improvement is hard to squarely attribute to any one
of my 3 items. The immediate problem that the query presents us with
on the master branch is the problem of filter quals that require heap
accesses to do visibility checks (a problem that index quals can never
have). That makes it tempting to credit my item 3. But you can't
really have item 3 without also having items 1 and 2. Taken together,
they eliminate all possible downsides from using index quals.

That high level direction (try to have one good choice for the
optimizer) seems important to me. Both for this project, and in
general.

Other changes in v2:

* Improved costing, that takes advantage of the fact that nbtree now
promises to not repeat any leaf page accesses (unless the scan is
restarted or the direction of the scan changes). This makes the worst
case far more predictable, and more related to selectivity estimation
-- you can't scan more pages than you have in the whole index. Just
like with every other sort of index scan.

* Support for parallel index scans.

The existing approach to array keys for parallel index scan has been
adopted to work with individual primitive index scans, not individual
array keys. I haven't tested this very thoroughly just yet, but it
seems to work well enough already. I think that it's important to not
have very much variation between parallel and serial index scans,
which I seem to have mostly avoided.

[1] https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491
--
Peter Geoghegan

Attachment Content-Type Size
v2-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 96.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-09-17 23:49:22 Re: to_regtype() Raises Error
Previous Message David E. Wheeler 2023-09-17 23:37:58 Re: to_regtype() Raises Error