Quick Links

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject:	Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date:	2023-11-10 02:05:39
Message-ID:	CAEze2WjF_URGQ0LcroVQ5YpsTfvMKMMjs_00p0eqmUcpj8vd-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 10 Nov 2023 at 00:58, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > If you end up finding a bug in this v6, it'll most likely be a case
> > where nbtree fails to live up to that. This project is as much about
> > robust/predictable performance as anything else -- nbtree needs to be
> > able to cope with practically anything. I suggest that your review
> > start by trying to break the patch along these lines.
>
> I spent some time on this myself today (which I'd already planned on).
>
> Attached is an adversarial stress-test, which shows something that
> must be approaching the worst case for the patch in terms of time
> spent with a buffer lock held, due to spending so much time evaluating
> unusually expensive SAOP index quals. The array binary searches that
> take place with a buffer lock held aren't quite like anything else
> that nbtree can do right now, so it's worthy of special attention.
>
> I thought of several factors that maximize both the number of binary
> searches within any given _bt_readpage, as well as the cost of each
> binary search -- the SQL file has full details. My test query is
> *extremely* unrealistic, since it combines multiple independent
> unrealistic factors, all of which aim to make life hard for the
> implementation in one way or another. I hesitate to say that it
> couldn't be much worse (I only spent a few hours on this), but I'm
> prepared to say that it seems very unlikely that any real world query
> could make the patch spend as many cycles in
> _bt_readpage/_bt_checkkeys as this one does.
>
> Perhaps you can think of some other factor that would make this test
> case even less sympathetic towards the patch, Matthias? The only thing
> I thought of that I've left out was the use of a custom btree opclass,
> "unrealistically_slow_ops". Something that calls pg_usleep in its
> order proc. (I left it out because it wouldn't prove anything.)

Have you tried using text index columns that are sorted with
non-default locales?
I've seen non-default locales use significantly more resources during
compare operations than any other ordering operation I know of (which
has mostly been in finding the locale), and use it extensively to test
improvements for worst index shapes over at my btree patchsets because
locales are dynamically loaded in text compare and nondefault locales
are not cached at all. I suspect that this would be even worse if a
somehow even worse locale path is available than what I'm using for
test right now; this could be the case with complex custom ICU
locales.

> On my machine, custom instrumentation shows that each call to
> _bt_readpage made while this query executes (on a patched server)
> takes just under 1.4 milliseconds. While that is far longer than it
> usually takes, it's basically acceptable IMV. It's not significantly
> longer than I'd expect heap_index_delete_tuples() to take on an
> average day with EBS (or other network-attached storage). But that's a
> process that happens all the time, with an exclusive buffer lock held
> on the leaf page throughout -- whereas this is only a shared buffer
> lock, and involves a query that's just absurd .
>
> Another factor that makes this seem acceptable is just how sensitive
> the test case is to everything going exactly and perfectly wrong, all
> at the same time, again and again. The test case uses a 32 column
> index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
> clauses (one per index column). If I reduce the number of SAOP clauses
> in the query to (say) 8, I still have a test case that's almost as
> silly as my original -- but now we only spend ~225 microseconds in
> each _bt_readpage call (i.e. we spend over 6x less time in each
> _bt_readpage call). (Admittedly if I also make the CREATE INDEX use
> only 8 columns, we can fit more index tuples on one page, leaving us
> at ~800 microseconds).

A quick update of the table definition to use the various installed
'fr-%-x-icu' locales on text hash columns instead of numeric with a
different collation for each column this gets me to EXPLAIN (analyze)
showing 2.07ms spent every buffer hit inside the index scan node, as
opposed to 1.76ms when using numeric. But, as you mention, the value
of this metric is probably not very high.

As for the patch itself, I'm probably about 50% through the patch now.
While reviewing, I noticed the following two user-visible items,
related to SAOP but not broken by or touched upon in this patch:

1. We don't seem to plan `column opr ALL (...)` as index conditions,
while this should be trivial to optimize for at least btree. Example:

SET enable_bitmapscan = OFF;
WITH a AS (select generate_series(1, 1000) a)
SELECT * FROM tenk1
WHERE thousand = ANY (array(table a))
AND thousand < ALL (array(table a));

This will never return any rows, but it does hit 9990 buffers in the
new btree code, while I expected that to be 0 buffers based on the
query and index (that is, I expected to hit 0 buffers, until I
realized that we don't push ALL into index filters). I shall assume
ALL isn't used all that often (heh), but it sure feels like we're
missing out on performance here.

2. We also don't seem to support array keys for row compares, which
probably is an even more niche use case:

SELECT count(*)
FROM tenk1
WHERE (thousand, tenthous) = ANY (ARRAY[(1, 1), (1, 2), (2, 1)]);

This is no different from master, too, but it'd be nice if there was
support for arrays of row operations, too, just so that composite
primary keys can also be looked up with SAOPs.

Kind regards,

Matthias van de Meent

In response to

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan at 2023-11-09 23:57:51 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2023-11-10 02:20:41	Re: A recent message added to pg_upgade
Previous Message	Andrey Chudnovsky	2023-11-10 01:42:52	Re: [PoC] Federated Authn/z with OAUTHBEARER