Re: POC, WIP: OR-clause support for indexes

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-01-14 15:13:54
Message-ID: 919bfbcb-f812-758d-d687-71f89f0d9a68@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I agree with your idea and try to implement it and will soon attach a
patch with a solution.

I also have a really practical example confirming that such optimization
can be useful.

A query was written that consisted of 50000 conditions due to the fact
that the ORM framework couldn't work with a query having an ANY
operator. In summary, we got a better plan that contained 50000 Bitmap
Index Scan nodes with 50000 different conditions. Since approximately
27336 Bite of memory were required to initialize one BitmapOr Index Scan
node, therefore, about 1.27 GB of memory was spent at the initialization
step of the plan execution and query execution time was about 55756,053
ms (00:55,756).

|psql -U postgres -c "CREATE DATABASE test_db" pgbench -U postgres -d
test_db -i -s 10 ||SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE
%s', '(' || string_agg('int', ',') || ')', string_agg(FORMAT('aid =
$%s', g.id), ' or ') ) AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec ||SELECT FORMAT('execute x %s;', '(' || string_agg(g.id::text, ',') ||
')') AS cmd FROM generate_series(1, 50000) AS g(id) \gexec |||||

||

I got the plan of this query:

|QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on pgbench_accounts a  (cost=44.35..83.96 rows=10
width=97)
   Recheck Cond: ((aid = 1) OR (aid = 2) OR (aid = 3) OR (aid = 4) OR
(aid = 5) OR (aid = 6) OR (aid = 7) OR (aid = 8) OR (aid = 9) OR (aid = 10))
   ->  BitmapOr  (cost=44.35..44.35 rows=10 width=0)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 1)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 2)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 3)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 4)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 5)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 6)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 7)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 8)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 9)
         ->  Bitmap Index Scan on pgbench_accounts_pkey
(cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 10)|

If I rewrite this query using ANY operator,

SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE aid = ANY(SELECT
g.id FROM generate_series(1, 50000) AS g(id))',
'(' || string_agg('int',',') ||')'
) AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

I will get a plan where the array comparison operator is used through
ANY operator at the index scan stage. It's execution time is
significantly lower as  339,764 ms.

QUERY PLAN
---------------------------------------------------------------------------------------------------
Index Scan using pgbench_accounts_pkey on pgbench_accounts a (cost=0.42..48.43 rows=10 width=97)
Index Cond: (aid = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
(2 rows)

IN operator is also converted to ANY operator, and if I rewrite this
query as:

SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE aid IN(%s)',
'(' || string_agg('int',',') ||')',
string_agg(FORMAT('%s', g.id),', ')
) AS cmd
FROM generate_series(1, 50000) AS g(id)
\gexec

I will get the same plan as the previous one using ANY operator and his
execution time will be about the same.

QUERY PLAN
---------------------------------------------------------------------------------------------------
Index Scan using pgbench_accounts_pkey on pgbench_accounts a (cost=0.42..48.43 rows=10 width=97)
Index Cond: (aid = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
(2 rows)

On 28.12.2022 07:19, Andrey Lepikhov wrote:
> On 12/26/15 23:04, Teodor Sigaev wrote:
>> I'd like to present OR-clause support for indexes. Although
>> OR-clauses could be supported by bitmapOR index scan it isn't very
>> effective and such scan lost any order existing in index. We (with
>> Alexander Korotkov) presented results on Vienna's conference this
>> year. In short, it provides performance improvement:
>>
>> EXPLAIN ANALYZE
>> SELECT count(*) FROM tst WHERE id = 5 OR id = 500 OR id = 5000;
>> ...
>> The problems on the way which I see for now:
>> 1 Calculating cost. Right now it's just a simple transformation of
>> costs computed for BitmapOr path. I'd like to hope that's possible
>> and so index's estimation function could be non-touched. So, they
>> could believe that all clauses are implicitly-ANDed
>> 2 I'd like to add such support to btree but it seems that it should
>> be a separated patch. Btree search algorithm doesn't use any kind of
>> stack of pages and algorithm to walk over btree doesn't clear for me
>> for now.
>> 3 I could miss some places which still assumes  implicitly-ANDed list
>> of clauses although regression tests passes fine.
> I support such a cunning approach. But this specific case, you
> demonstrated above, could be optimized independently at an earlier
> stage. If to convert:
>
> (F(A) = ConstStableExpr_1) OR (F(A) = ConstStableExpr_2)
> to
> F(A) IN (ConstStableExpr_1, ConstStableExpr_2)
>
> it can be seen significant execution speedup. For example, using the
> demo.sql to estimate maximum positive effect we see about 40% of
> execution and 100% of planning speedup.
>
> To avoid unnecessary overhead, induced by the optimization, such
> transformation may be made at the stage of planning (we have
> cardinality estimations and have pruned partitions) but before
> creation of a relation scan paths. So, we can avoid planning overhead
> and non-optimal BitmapOr in the case of many OR's possibly aggravated
> by many indexes on the relation.
> For example, such operation can be executed in create_index_paths()
> before passing rel->indexlist.
>
--
Alena Rybakina
Postgres Professional

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marcos Pegoraro 2023-01-14 15:45:27 Re: POC, WIP: OR-clause support for indexes
Previous Message Andrew Dunstan 2023-01-14 14:49:12 Re: Extracting cross-version-upgrade knowledge from buildfarm client