Re: POC, WIP: OR-clause support for indexes

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-10-14 22:34:51
Message-ID: CAPpHfduJtO0s9E=SHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Alena!

Thank you for your work on the subject.

On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
> I fixed the kernel dump issue and all the regression tests were successful, but I discovered another problem when I added my own regression tests.
> Some queries that contain "or" expressions do not convert to "ANY". I have described this in more detail using diff as expected and real results:
>
> diff -U3 /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out
> --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out 2023-10-04 21:54:12.496282667 +0300
> +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out 2023-10-04 21:55:41.665422459 +0300
> @@ -1925,17 +1925,20 @@
> EXPLAIN (COSTS OFF)
> SELECT count(*) FROM tenk1
> WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41;
> - QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
> + QUERY PLAN
> +---------------------------------------------------------------------------------------------------------------------------
> Aggregate
> -> Bitmap Heap Scan on tenk1
> - Recheck Cond: (((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[]))) OR (thousand = 41))
> + Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3))) OR (thousand = 41))
> -> BitmapOr
> - -> Bitmap Index Scan on tenk1_thous_tenthous
> - Index Cond: ((thousand = 42) AND (tenthous = ANY ('{1,3}'::integer[])))
> + -> BitmapOr
> + -> Bitmap Index Scan on tenk1_thous_tenthous
> + Index Cond: ((thousand = 42) AND (tenthous = 1))
> + -> Bitmap Index Scan on tenk1_thous_tenthous
> + Index Cond: ((thousand = 42) AND (tenthous = 3))
> -> Bitmap Index Scan on tenk1_thous_tenthous
> Index Cond: (thousand = 41)
> -(8 rows)
> +(11 rows)

I think this query is not converted, because you only convert
top-level ORs in the transform_ors() function. But in the example
given, the target OR lays under AND, which in turn lays under another
OR. I think you need to make transform_ors() recursive to handle
cases like this.

I wonder about the default value of the parameter or_transform_limit
of 500. In [1] and [2] you show the execution time degradation from 0
to ~500 OR clauses. I made a simple SQL script with the query "SELECT
* FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid =
100;". The pgbench results for a single connection in prepared mode
are the following.
master: 936 tps
patched (or_transform_limit == 0) :1414 tps
So, transformation to ANY obviously accelerates the execution.

I think it's important to identify the cases where this patch causes
the degradation. Generally, I don't see why ANY could be executed
slower than the equivalent OR clause. So, the possible degradation
cases are slower plan generation and worse plans. I managed to find
both.

As you stated before, currently the OR transformation has a quadratic
complexity depending on the number of or-clause-groups. I made a
simple test to evaluate this. containing 10000 or-clause-groups.
SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 *
bid = 1 OR ... OR aid + 10000 * bid = 1;
master: 316ms
patched: 7142ms
Note, that the current or_transform_limit GUC parameter is not capable
of cutting such cases, because it cuts cases lower than the limit not
higher than the limit. In the comment, you mention that we could
invent something like hash to handle this. Hash should be nice, but
the problem is that we currently don't have a generic facility to hash
nodes (or even order them). It would be nice to add this facility,
that would be quite a piece of work. I would propose to limit this
patch for now to handle just a single Var node as a non-const side of
the clause and implement a simple hash for Vars.

Another problem is the possible generation of worse plans. I made an
example table with two partial indexes.
create table test as (select (random()*10)::int x, (random()*1000) y
from generate_series(1,1000000) i);
create index test_x_1_y on test (y) where x = 1;
create index test_x_2_y on test (y) where x = 2;
vacuum analyze test;

Without the transformation of ORs to ANY, our planner manages to use
both indexes with a Bitmap scan.
# explain select * from test where (x = 1 or x = 2) and y = 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12)
Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y =
'100'::double precision) AND (x = 2)))
-> BitmapOr (cost=8.60..8.60 rows=1 width=0)
-> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0)
Index Cond: (y = '100'::double precision)
-> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0)
Index Cond: (y = '100'::double precision)
(7 rows)

With transformation, the planner can't use indexes.
# explain select * from test where (x = 1 or x = 2) and y = 100;
QUERY PLAN
-----------------------------------------------------------------------------
Gather (cost=1000.00..12690.10 rows=1 width=12)
Workers Planned: 2
-> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12)
Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision))
(4 rows)

The solution I see would be to tech Bitmap scan to handle ANY clause
in the same way as the OR clause. I think the entry point for the
relevant logic is the choose_bitmap_and() function.

Regarding the GUC parameter, I don't see we need a limit. It's not
yet clear whether a small number or a large number of OR clauses are
more favorable for transformation. I propose to have just a boolean
enable_or_transformation GUC.

Links
1. https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru
2. https://www.postgresql.org/message-id/938d82e1-98df-6553-334c-9db7c4e288ae%40yandex.ru

------
Regards,
Alexander Korotkov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2023-10-14 23:51:05 Re: Patch: Improve Boolean Predicate JSON Path Docs
Previous Message David E. Wheeler 2023-10-14 20:45:35 Re: Patch: Improve Boolean Predicate JSON Path Docs