Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-03-22 00:32:42
Message-ID: CAH2-WznZ5ebfv96DqpKArBBcdnCUxkMOhFGipRA5mdhPKjVHsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 20, 2024 at 3:26 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> OK, here's a small additional review, with a suggestion for additional
> changes to _bt_preprocess:

Attached is v16. This has some of the changes that you asked for. But
my main focus has been on fixing regressions that were discovered
during recent performance validation.

Up until now, my stress testing of the patch (not quite the same thing
as benchmarking) more or less consisted of using pgbench to look at
two different types of cases. These are:

1. Extreme cases where the patch obviously helps the most, at least
insofar as navigating the index structure itself is concerned (note
that avoiding heap accesses isn't in scope for the pgbench stress
testing).

These are all variants of pgbench's SELECT workload, with a huge IN()
list containing contiguous integers (we can get maybe a 5.3x increase
in TPS here), or integers that are spaced apart by maybe 20 - 50
tuples (where we can get maybe a 20% - 30% improvement in TPS here).

2. The opposite extreme, where an IN() list has integers that are
essentially random, and so are spaced very far apart on average.

Another variant of pgbench's SELECT workload, with an IN() list. This
is a case that the patch has hardly any chance of helping, so the goal
here is to just avoid regressions. Since I've been paying attention to
this for a long time, this wasn't really a problem for v15.

3. Cases where matching tuples are spaced apart by 100 - 300
tuples/integer values.

These cases are "somewhere between 1 and 2". Cases where I would
expect to win by a smaller amount (say a 5% improvement in TPS), but
v15 nevertheless lost by as much as 12% here. This was a problem that
seemed to demand a fix.

The underlying reason why all earlier versions of the patch had these
regressions came down to their use of a pure linear search (by which I
mean having _bt_readpage call _bt_checkeys again and again on the same
page). That was just too slow here. v15 could roughly halve the number
of index descents compared to master, as expected -- but that wasn't
enough to make up for the overhead of having to plow through hundreds
of non-matching tuples on every leaf page. v15 couldn't even break
even.

v16 of the patch fixes the problem by adding a limited additional form
of "skipping", used *within* a page, as it is scanned by _bt_readpage.
(As opposed to skipping over the index by redescending anew, starting
from the root page.)

In other words, v16 teaches the linear search process to occasionally
"look ahead" when it seems like the linear search process has required
too many comparisons at that point (comparisons by
_bt_tuple_before_array_skeys made from within _bt_checkkeys). You can
think of this new "look ahead" mechanism as bridging the gap between
case 1 and case 2 (fixing case 3, a hybrid of 1 and 2). We still
mostly want to use a "linear search" from within _bt_readpage, but we
do benefit from having this fallback when that just isn't working very
well. Now even these new type 3 stress tests see an increase in TPS of
perhaps 5% - 12%, depending on just how far apart you arrange for
matching tuples to be spaced apart by (the total size of the IN list
is also relevant, but much less so).

Separately, I added a new optimization to the binary search process
that selects the next array element via a binary search of the array
(actually, to the function _bt_binsrch_array_skey), which lowers the
cost of advancing required arrays that trigger advancement: we only
really need one comparison per _bt_binsrch_array_skey call in almost
all cases. In practice we'll almost certainly find that the next array
element in line is <= the unsatisfied tuple datum (we already know
from context that the current/former array element must now be < that
same tuple datum at that point, so the conditions under which this
doesn't work out right away are narrow). This second optimization is
much more general than the first one. It helps with pretty much any
kind of adversarial pgbench stress test.

> Here, you insert your code between the comment about which opfamily to
> choose and the code assigning the opfamily. I think this needs some
> cleanup.

Fixed.

> > + * Don't call _bt_preprocess_array_keys_final in this fast path
> > + * (we'll miss out on the single value array transformation, but
> > + * that's not nearly as important when there's only one scan key)
>
> Why is it OK to ignore it? Or, why don't we apply it here?

It doesn't seem worth adding any cycles to that fast path, given that
the array transformation in question can only help us to convert "=
any('{1}')" into "= 1", because there's no other scan keys that can
cut down the number of array constants first (through earlier
array-specific preprocessing steps).

Note that even this can't be said for converting "IN (1)" into "= 1",
since the planner already does that for us, without nbtree ever seeing
it directly.

> Attached 2 patches for further optimization of the _bt_preprocess_keys
> path (on top of your v15), according to the following idea:
>
> Right now, we do "expensive" processing with xform merging for all
> keys when we have more than 1 keys in the scan. However, we only do
> per-attribute merging of these keys, so if there is only one key for
> any attribute, the many cycles spent in that loop are mostly wasted.

Why do you say that? I agree that calling _bt_compare_scankey_args()
more than necessary might matter, but that doesn't come up when
there's only one scan key per attribute. We'll only copy each scan key
into the strategy-specific slot for the attribute's temp xform[]
array.

Actually, it doesn't necessarily come up when there's more than one
scan key per index attribute, depending on the strategies in use. That
seems like an existing problem on HEAD; I think we should be doing
this more. We could stand to do better at preprocessing when
inequalities are mixed together. Right now, we can't detect
contradictory quals, given a case such as "SELECT * FROM
pgbench_accounts WHERE aid > 5 AND aid < 3". It'll only work for
something like "WHERE aid = 5 AND aid < 3", so we'll still useless
descend the index once (not a disaster, but not ideal either).

Honestly, I don't follow what the goal is here. Does it really help to
restructure the loop like this? Feels like I'm missing something. We
do this when there's only one scan key, sort of, but that's probably
not really that important anyway. Maybe it helps a bit for nestloop
joins with an inner index scan, where one array scan key is common.

--
Peter Geoghegan

Attachment Content-Type Size
v16-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 188.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2024-03-22 00:35:53 Re: SQL:2011 application time
Previous Message Michał Kłeczek 2024-03-22 00:28:55 Re: DRAFT: Pass sk_attno to consistent function