Re: POC, WIP: OR-clause support for indexes

From: Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>
Cc: Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-08-02 15:58:37
Message-ID: b301dce1-09fd-72b1-834a-527ca428db5e@yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I fixed an error that caused the current optimization not to work with
prepared queries. I added a test to catch similar cases in the future.
I have attached a patch.

On 01.08.2023 22:42, Peter Geoghegan wrote:
> On Mon, Jul 31, 2023 at 9:38 AM Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> wrote:
>> I noticed only one thing there: when we have unsorted array values in
>> SOAP, the query takes longer than
>> when it has a sorted array. I'll double-check it just in case and write
>> about the results later.
> I would expect the B-Tree preprocessing by _bt_preprocess_array_keys()
> to be very slightly faster when the query is written with presorted,
> duplicate-free constants. Sorting is faster when you don't really have
> to sort. However, I would not expect the effect to be significant
> enough to matter, except perhaps in very extreme cases.
> Although...some of the cases you care about are very extreme cases.
I tested an optimization to compare execution time and scheduling with
sorting, shuffling, and reverse sorting constants in the simple case and
I didn't notice any significant changes (compare_sorted.png).
(I used a database with 100 million values generated by pgbench).
>> I am also testing some experience with multi-column indexes using SAOPs.
> Have you thought about a similar transformation for when the row
> constructor syntax happens to have been used?
>
> Consider a query like the following, against a table with a composite
> index on (a, b):
>
> select * from multi_test where ( a, b ) in (( 1, 1 ), ( 2, 1 ));
>
> This query will get a BitmapOr based plan that's similar to the plans
> that OR-based queries affected by your transformation patch get today,
> on HEAD. However, this equivalent spelling has the potential to be
> significantly faster:
>
> select * from multi_test where a = any('{1,2}') and b = 1;
>
> (Of course, this is more likely to be true with my nbtree SAOP patch in place.)
No, I haven't thought about it yet. I studied the example and it would
really be nice to add optimization here. I didn't notice any problems
with its implementation. I also have an obvious example with the "or"
operator, for example
, select * from multi_test, where (a, b ) = ( 1, 1 ) or (a, b ) = ( 2, 1
) ...;

Although I think such a case will be used less often.

Thank you for the example, I think I understand better why our patches
help each other, but I will review your patch again.

I tried another example to see the lack of optimization in the pgbench
database, but I also created an additional index:

create index ind1 on pgbench_accounts(aid,bid);

test_db=# explain analyze select * from pgbench_accounts where (aid,
bid) in ((2,1), (2,2), (2,3), (3,3));
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on pgbench_accounts  (cost=17.73..33.66 rows=1
width=97) (actual time=0.125..0.133 rows=1 loops=1)
   Recheck Cond: ((aid = 2) OR (aid = 2) OR (aid = 2) OR (aid = 3))
   Filter: (((aid = 2) AND (bid = 1)) OR ((aid = 2) AND (bid = 2)) OR
((aid = 2) AND (bid = 3)) OR ((aid = 3) AND (bid = 3)))
   Rows Removed by Filter: 1
   Heap Blocks: exact=1
   ->  BitmapOr  (cost=17.73..17.73 rows=4 width=0) (actual
time=0.100..0.102 rows=0 loops=1)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0) (actual time=0.036..0.037 rows=1 loops=1)
               Index Cond: (aid = 2)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0) (actual time=0.021..0.022 rows=1 loops=1)
               Index Cond: (aid = 2)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0) (actual time=0.021..0.021 rows=1 loops=1)
               Index Cond: (aid = 2)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0) (actual time=0.019..0.020 rows=1 loops=1)
               Index Cond: (aid = 3)
 Planning Time: 0.625 ms
 Execution Time: 0.227 ms
(16 rows)

I think such optimization would be useful here: aid =2 and bid in (1,2)
or (aid,bid)=((3,3))

> Note that we currently won't use RowCompareExpr in many simple cases
> where the row constructor syntax has been used. For example, a query
> like this:
>
> select * from multi_test where ( a, b ) = (( 2, 1 ));
>
> This case already involves a transformation that is roughly comparable
> to the one you're working on now. We'll remove the RowCompareExpr
> during parsing. It'll be as if my example row constructor equality
> query was written this way instead:
>
> select * from multi_test where a = 2 and b = 1;
>
> This can be surprisingly important, when combined with other things,
> in more realistic examples.
>
> The nbtree code has special knowledge of RowCompareExpr that makes the
> rules for comparing index tuples different to those from other kinds
> of index scans. However, due to the RowCompareExpr transformation
> process I just described, we don't need to rely on that specialized
> nbtree code when the row constructor syntax is used with a simple
> equality clause -- which is what makes the normalization process have
> real value. If the nbtree code cannot see RowCompareExpr index quals
> then it cannot have this problem in the first place. In general it is
> useful to "normalize to conjunctive normal form" when it might allow
> scan key preprocessing in the nbtree code to come up with a much
> faster approach to executing the scan.
>
> It's easier to understand what I mean by showing a simple example. The
> nbtree preprocessing code is smart enough to recognize that the
> following query doesn't really need to do any work, due to having
> quals that it recognizes as contradictory (it can set so->qual_okay to
> false for unsatisfiable quals):
>
> select * from multi_test where ( a, b ) = (( 2, 1 )) and a = -1;
>
> However, it is not smart enough to perform the same trick if we change
> one small detail with the query:
>
> select * from multi_test where ( a, b ) >= (( 2, 1 )) and a = -1;

Yes, I have run the examples and I see it.

((ROW(aid, bid) >= ROW(2, 1)) AND (aid = '-1'::integer))

As I see it, we can implement such a transformation:

'( a, b ) >= (( 2, 1 )) and a = -1'     ->    'aid >= 2 and bid >= 1 and
aid =-1'

It seems to me the most difficult thing is to notice problematic cases
where the transformations are incorrect, but I think it can be implemented.

> Ideally, the optimizer would canonicalize/normalize everything in a
> way that made all of the nbtree preprocessing optimizations work just
> as well, without introducing any new special cases. Obviously, there
> is no reason why we can't perform the same trick with the second
> variant. (Note also that the nbtree preprocessing code can be smart
> about redundant quals, not just contradictory quals, so it matters
> more than it may appear from this simple, unrealistic example of
> mine.)
I agree with your position, but I still don't understand how to consider
transformations to generalized cases without relying on special cases.

As I understand it, you assume that it is possible to apply
transformations at the index creation stage, but there I came across the
selectivity overestimation problem.

I still haven't found a solution for this problem.

> While these similar RowCompareExpr transformations are at least
> somewhat important, that's not really why I bring them up now. I am
> pointing them out now because I think that it might help you to
> develop a more complete mental model of these transformations.
> Ideally, your initial approach will generalize to other situations
> later on. So it's worth considering the relationship between this
> existing RowCompareExpr transformation, and the one that you're
> working on currently. Plus other, future transformations.
I will consider my case more broadly, but for this I will need some
research work.
> This example might also give you some appreciation of why my SAOP
> patch is confused about when we need to do normalization/safety
> checks. Some things seem necessary when generating index paths in the
> optimizer. Other things seem necessary during preprocessing, in the
> nbtree code, at the start of the index scan. Unfortunately, it's not
> obvious to me where the right place is to deal with each aspect of
> setting up multi-column SAOP index quals. My mental model is very
> incomplete.
To be honest, I think that in your examples I understand better what you
mean by normalization to the conjunctive norm, because I only had a
theoretical idea from the logic course.

Hence, yes, normalization/security checks - now I understand why they
are necessary.

--
Regards,
Alena Rybakina
Postgres Professional

Attachment Content-Type Size
v7-Replace-OR-clause-to-ANY-expressions.patch text/x-patch 34.2 KB
compare_sorted.png image/png 78.5 KB
diff.diff text/x-patch 3.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-08-02 15:59:15 Re: add timing information to pg_upgrade
Previous Message jian he 2023-08-02 15:46:41 Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression