Re: POC, WIP: OR-clause support for indexes

From: "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-10-15 23:21:39
Message-ID: c27333ea-3515-487e-8025-468e7d86ddff@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi! Thank you for your review!

On 15.10.2023 01:34, Alexander Korotkov wrote:
> Hi, Alena!
>
> Thank you for your work on the subject.
>
> On Wed, Oct 4, 2023 at 10:21 PM a.rybakina<a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I added my own regression tests.
>> Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff as expected and real results:
>>
>> diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out
>> --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300
>> +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out 2023-10-04 21:55:41.665422459 +0300
>> @@ -1925,17 +1925,20 @@
>> EXPLAIN (COSTS OFF)
>> SELECT count(*) FROM tenk1
>> WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
>> - QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------
>> + QUERY PLAN
>> +---------------------------------------------------------------------------------------------------------------------------
>> Aggregate
>> -> Bitmap Heap Scan on tenk1
>> - Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41))
>> + Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand = 41))
>> -> BitmapOr
>> - -> Bitmap Index Scan on tenk1_thous_tenthous
>> - Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[])))
>> + -> BitmapOr
>> + -> Bitmap Index Scan on tenk1_thous_tenthous
>> + Index Cond: ((thousand = 42) AND (tenthous = 1))
>> + -> Bitmap Index Scan on tenk1_thous_tenthous
>> + Index Cond: ((thousand = 42) AND (tenthous = 3))
>> -> Bitmap Index Scan on tenk1_thous_tenthous
>> Index Cond: (thousand = 41)
>> -(8 rows)
>> +(11 rows)
> I think this query is not converted, because you only convert
> top-level ORs in the transform_ors() function. But in the example
> given, the target OR lays under AND, which in turn lays under another
> OR. I think you need to make transform_ors() recursive to handle
> cases like this.
Yes, you are right, it seems that a recursive method is needed here.
> I wonder about the default value of the parameter or_transform_limit
> of 500. In [1] and [2] you show the execution time degradation from 0
> to ~500 OR clauses. I made a simple SQL script with the query "SELECT
> * FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid =
> 100;". The pgbench results for a single connection in prepared mode
> are the following.
> master: 936 tps
> patched (or_transform_limit == 0) :1414 tps
> So, transformation to ANY obviously accelerates the execution.
>
> I think it's important to identify the cases where this patch causes
> the degradation. Generally, I don't see why ANY could be executed
> slower than the equivalent OR clause. So, the possible degradation
> cases are slower plan generation and worse plans. I managed to find
> both.
>
> As you stated before, currently the OR transformation has a quadratic
> complexity depending on the number of or-clause-groups. I made a
> simple test to evaluate this. containing 10000 or-clause-groups.
> SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 *
> bid = 1 OR ... OR aid + 10000 * bid = 1;
> master: 316ms
> patched: 7142ms
> Note, that the current or_transform_limit GUC parameter is not capable
> of cutting such cases, because it cuts cases lower than the limit not
> higher than the limit. In the comment, you mention that we could
> invent something like hash to handle this. Hash should be nice, but
> the problem is that we currently don't have a generic facility to hash
> nodes (or even order them). It would be nice to add this facility,
> that would be quite a piece of work. I would propose to limit this
> patch for now to handle just a single Var node as a non-const side of
> the clause and implement a simple hash for Vars.
I ran the query and saw that you were right, this place in the patch
turns out to be very expensive. In addition to the hash, I saw a second
solution to this problem - parameterize constants and store them in the
list, but this will not be such a universal solution as hashing. If the
variable, not the constant, changes, parameterization will not help.

I agree with your suggestion to try adding hashing. I'll take a closer
look at this.

> Another problem is the possible generation of worse plans. I made an
> example table with two partial indexes.
> create table test as (select (random()*10)::int x, (random()*1000) y
> from generate_series(1,1000000) i);
> create index test_x_1_y on test (y) where x = 1;
> create index test_x_2_y on test (y) where x = 2;
> vacuum analyze test;
>
> Without the transformation of ORs to ANY, our planner manages to use
> both indexes with a Bitmap scan.
> # explain select * from test where (x = 1 or x = 2) and y = 100;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12)
> Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y =
> '100'::double precision) AND (x = 2)))
> -> BitmapOr (cost=8.60..8.60 rows=1 width=0)
> -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0)
> Index Cond: (y = '100'::double precision)
> -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0)
> Index Cond: (y = '100'::double precision)
> (7 rows)
>
> With transformation, the planner can't use indexes.
> # explain select * from test where (x = 1 or x = 2) and y = 100;
> QUERY PLAN
> -----------------------------------------------------------------------------
> Gather (cost=1000.00..12690.10 rows=1 width=12)
> Workers Planned: 2
> -> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12)
> Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision))
> (4 rows)
>
> The solution I see would be to tech Bitmap scan to handle ANY clause
> in the same way as the OR clause. I think the entry point for the
> relevant logic is the choose_bitmap_and() function.
It's a good idea, I'll try.
But to be honest, I'm afraid that problems with selectivity may come up
again and in order to solve them, additional processing of RestrictInfo
may be required, which will be unnecessarily expensive. As far as I
understand, at this stage we are creating indexes for AND expressions
and there is a risk that its transformation may cause the need to change
references in all possible places where it was referenced.
> Regarding the GUC parameter, I don't see we need a limit. It's not
> yet clear whether a small number or a large number of OR clauses are
> more favorable for transformation. I propose to have just a boolean
> enable_or_transformation GUC.
>
> Links
> 1.https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru
> 2.https://www.postgresql.org/message-id/938d82e1-98df-6553-334c-9db7c4e288ae%40yandex.ru

I tend to agree with you and I see that in some cases it really doesn't
help.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2023-10-15 23:47:03 Re: On login trigger: take three
Previous Message David E. Wheeler 2023-10-15 23:04:39 Re: Patch: Improve Boolean Predicate JSON Path Docs