Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2023-11-08 01:53:01
Message-ID: CAH2-Wz=Ct9fLcN-2SV0gx_iBk87RsAi+Sor3OwR2_5Rg_KO0+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 7, 2023 at 4:20 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> On Tue, 7 Nov 2023 at 00:03, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > I should be able to post v6 later this week. My current plan is to
> > commit the other nbtree patch first (the backwards scan "boundary
> > cases" one from the ongoing CF) -- since I saw your review earlier
> > today. I think that you should probably wait for this v6 before
> > starting your review.
>
> Okay, thanks for the update, then I'll wait for v6 to be posted.

On second thought, I'll just post v6 now (there won't be conflicts
against the master branch once the other patch is committed anyway).

Highlights:

* Major simplifications to the array key state machine, already
described by my recent email.

* Added preprocessing of "redundant and contradictory" array elements
to _bt_preprocess_array_keys().

This makes the special preprocessing pass just for array keys
("preprocessing preprocessing") within _bt_preprocess_array_keys()
make this query into a no-op:

select * from tab where a in (180, 345) and a in (230, 300); -- contradictory

Similarly, it can make this query only attempt one single primitive
index scan for "230":

select * from tab where a in (180, 230) and a in (230, 300); -- has
redundancies, plus some individual elements contradict each other

This duplicates some of what _bt_preprocess_keys can do already. But
_bt_preprocess_keys can only do this stuff at the level of individual
array elements/primitive index scans. Whereas this works "one level
up", allowing preprocessing to see the full picture rather than just
seeing the start of one particular primitive index scan. It explicitly
works across array keys, saving repeat work inside
_bt_preprocess_keys. That could really add up with thousands of array
keys and/or multiple SAOPs. (Note that _bt_preprocess_array_keys
already does something like this, to deal with SAOP inequalities such
as "WHERE my_col >= any (array[1, 2])" -- it's a little surprising
that this obvious optimization wasn't part of the original nbtree SAOP
patch.)

This reminds me: you might want to try breaking the patch by coming up
with adversarial cases, Matthias. The patch needs to be able to deal
with absurdly large amounts of array keys reasonably well, because it
proposes to normalize passing those to the nbtree code. It's
especially important that the patch never takes too much time to do
something (e.g., binary searching through array keys) while holding a
buffer lock -- even with very silly adversarial queries.

So, for example, queries like this one (specifically designed to
stress the implementation) *need* to work reasonably well:

with a as (
select i from generate_series(0, 500000) i
)
select
count(*), thousand, tenthous
from
tenk1
where
thousand = any (array[(select array_agg(i) from a)]) and
tenthous = any (array[(select array_agg(i) from a)])
group by
thousand, tenthous
order by
thousand, tenthous;

(You can run this yourself after the regression tests finish, of course.)

This takes about 130ms on my machine, hardly any of which takes place
in the nbtree code with the patch (think tens of microseconds per
_bt_readpage call, at most) -- the plan is an index-only scan that
gets only 30 buffer hits. On the master branch, it's vastly slower --
1000025 buffer hits. The query as a whole takes about 3 seconds there.

If you have 3 or 4 SOAPs (with a composite index that has as many
columns) you can quite easily DOS the master branch, since the planner
makes a generic assumption that each of these SOAPs will have only 10
elements. The planner also thinks that with the patch applied, with
one important difference: it doesn't matter to nbtree. The cost of
scanning each index page should be practically independent of the
total size of each array, at least past a certain point. Similarly,
the maximum cost of an index scan should be approximately fixed: it
should be capped at the cost of a full index scan (with the added cost
of these relatively expensive quals still capped, still essentially
independent of array sizes past some point).

I notice that if I remove the "thousand = any (array[(select
array_agg(i) from a)]) and" line from the adversarial query, executing
the resulting query still get 30 buffer hits with the patch -- though
it only takes 90ms this time (it's faster for reasons that likely have
less than you'd think to do with nbtree overheads). This is just
another way of getting roughly the same full index scan. That's a
completely new way of thinking about nbtree SAOPs from a planner
perspective (also from a user's perspective, I suppose).

It's important that the planner's new optimistic assumptions about the
cost profile of SOAPS (that it can expect reasonable
performance/access patterns with wildly unreasonable/huge/arbitrarily
complicated SAOPs) always be met by nbtree -- no repeat index page
accesses, no holding a buffer lock for more than (say) a small
fraction of 1 millisecond (no matter the complexity of the query), and
possibly other things I haven't thought of yet.

If you end up finding a bug in this v6, it'll most likely be a case
where nbtree fails to live up to that. This project is as much about
robust/predictable performance as anything else -- nbtree needs to be
able to cope with practically anything. I suggest that your review
start by trying to break the patch along these lines.

--
Peter Geoghegan

Attachment Content-Type Size
v6-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/octet-stream 115.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2023-11-08 03:13:50 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Michael Paquier 2023-11-08 01:27:44 Re: Show WAL write and fsync stats in pg_stat_io