Re: POC, WIP: OR-clause support for indexes

From: "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-10-25 11:04:24
Message-ID: 3c539d4b-1c3c-4119-a3c9-e335b81c83cf@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On 15.10.2023 01:34, Alexander Korotkov wrote:
> Hi, Alena!
>
> Thank you for your work on the subject.
>
> On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I added my own regression tests.
>> Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff as expected and real results:
>>
>> diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out
>> --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300
>> +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out 2023-10-04 21:55:41.665422459 +0300
>> @@ -1925,17 +1925,20 @@
>> EXPLAIN (COSTS OFF)
>> SELECT count(*) FROM tenk1
>> WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
>> - QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------
>> + QUERY PLAN
>> +---------------------------------------------------------------------------------------------------------------------------
>> Aggregate
>> -> Bitmap Heap Scan on tenk1
>> - Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41))
>> + Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand = 41))
>> -> BitmapOr
>> - -> Bitmap Index Scan on tenk1_thous_tenthous
>> - Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[])))
>> + -> BitmapOr
>> + -> Bitmap Index Scan on tenk1_thous_tenthous
>> + Index Cond: ((thousand = 42) AND (tenthous = 1))
>> + -> Bitmap Index Scan on tenk1_thous_tenthous
>> + Index Cond: ((thousand = 42) AND (tenthous = 3))
>> -> Bitmap Index Scan on tenk1_thous_tenthous
>> Index Cond: (thousand = 41)
>> -(8 rows)
>> +(11 rows)
> I think this query is not converted, because you only convert
> top-level ORs in the transform_ors() function. But in the example
> given, the target OR lays under AND, which in turn lays under another
> OR. I think you need to make transform_ors() recursive to handle
> cases like this.
>
> I wonder about the default value of the parameter or_transform_limit
> of 500. In [1] and [2] you show the execution time degradation from 0
> to ~500 OR clauses. I made a simple SQL script with the query "SELECT
> * FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid =
> 100;". The pgbench results for a single connection in prepared mode
> are the following.
> master: 936 tps
> patched (or_transform_limit == 0) :1414 tps
> So, transformation to ANY obviously accelerates the execution.
>
> I think it's important to identify the cases where this patch causes
> the degradation. Generally, I don't see why ANY could be executed
> slower than the equivalent OR clause. So, the possible degradation
> cases are slower plan generation and worse plans. I managed to find
> both.
>
> As you stated before, currently the OR transformation has a quadratic
> complexity depending on the number of or-clause-groups. I made a
> simple test to evaluate this. containing 10000 or-clause-groups.
> SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 *
> bid = 1 OR ... OR aid + 10000 * bid = 1;
> master: 316ms
> patched: 7142ms
> Note, that the current or_transform_limit GUC parameter is not capable
> of cutting such cases, because it cuts cases lower than the limit not
> higher than the limit. In the comment, you mention that we could
> invent something like hash to handle this. Hash should be nice, but
> the problem is that we currently don't have a generic facility to hash
> nodes (or even order them). It would be nice to add this facility,
> that would be quite a piece of work. I would propose to limit this
> patch for now to handle just a single Var node as a non-const side of
> the clause and implement a simple hash for Vars.
>
> Another problem is the possible generation of worse plans. I made an
> example table with two partial indexes.
> create table test as (select (random()*10)::int x, (random()*1000) y
> from generate_series(1,1000000) i);
> create index test_x_1_y on test (y) where x = 1;
> create index test_x_2_y on test (y) where x = 2;
> vacuum analyze test;
>
> Without the transformation of ORs to ANY, our planner manages to use
> both indexes with a Bitmap scan.
> # explain select * from test where (x = 1 or x = 2) and y = 100;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12)
> Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y =
> '100'::double precision) AND (x = 2)))
> -> BitmapOr (cost=8.60..8.60 rows=1 width=0)
> -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0)
> Index Cond: (y = '100'::double precision)
> -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0)
> Index Cond: (y = '100'::double precision)
> (7 rows)
>
> With transformation, the planner can't use indexes.
> # explain select * from test where (x = 1 or x = 2) and y = 100;
> QUERY PLAN
> -----------------------------------------------------------------------------
> Gather (cost=1000.00..12690.10 rows=1 width=12)
> Workers Planned: 2
> -> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12)
> Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision))
> (4 rows)
>
> The solution I see would be to tech Bitmap scan to handle ANY clause
> in the same way as the OR clause. I think the entry point for the
> relevant logic is the choose_bitmap_and() function.
>
> Regarding the GUC parameter, I don't see we need a limit. It's not
> yet clear whether a small number or a large number of OR clauses are
> more favorable for transformation. I propose to have just a boolean
> enable_or_transformation GUC.
>
I removed the limit from the hook, left the option to enable it or not.

I replaced the data structure so that the groups were formed not in a
list, but in a hash table. It seems to work fine, but I haven't figured
out yet why in some cases the regression test results are different and
the function doesn't work.

So far, I have formed a patch for the version where the conversion takes
place in parsing, since so far this patch looks the most reliable for me

For convenience, I have formed a patch for the very first version so far.

I have a suspicion that the problem is in the part where we form a hash
from a string. I'm still figuring it out.

Attachment Content-Type Size
v8.0-Replace-OR-clause-to-ANY-expressions.-Replace-X-N1-O.patch text/x-patch 33.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2023-10-25 11:13:57 Re: CDC/ETL system on top of logical replication with pgoutput, custom client
Previous Message José Neves 2023-10-25 10:53:47 RE: CDC/ETL system on top of logical replication with pgoutput, custom client