Re: POC, WIP: OR-clause support for indexes

From: "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-08-17 10:08:03
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, all!
>> The optimizer will itself do a limited form of "normalizing to CNF".
>> Are you familiar with extract_restriction_or_clauses(), from
>> orclauses.c? Comments above the function have an example of how this
>> can work:
>>   * Although a join clause must reference multiple relations overall,
>>   * an OR of ANDs clause might contain sub-clauses that reference
>> just one
>>   * relation and can be used to build a restriction clause for that rel.
>>   * For example consider
>>   *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45));
>>   * We can transform this into
>>   *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45))
>>   *          AND (a.x = 42 OR a.x = 44)
>>   *          AND (b.y = 43 OR b.z = 45);
>>   * which allows the latter clauses to be applied during the scans of
>> a and b,
>>   * perhaps as index qualifications, and in any case reducing the
>> number of
>>   * rows arriving at the join.  In essence this is a partial
>> transformation to
>>   * CNF (AND of ORs format).  It is not complete, however, because we
>> do not
>>   * unravel the original OR --- doing so would usually bloat the
>> qualification
>>   * expression to little gain.
> This is an interesting feature. I didn't notice this function before,
> I studied many times consider_new_or_cause, which were called there.
> As far as I know, there is a selectivity calculation going on there,
> but as far as I remember, I called it earlier after my conversion, and
> unfortunately it didn't solve my problem with calculating selectivity.
> I'll reconsider it again, maybe I can find something I missed.
>> Of course this immediately makes me wonder: shouldn't your patch be
>> able to perform an additional transformation here? You know, by
>> transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
>> haven't checked for myself, I assume that this doesn't happen right
>> now, since your patch currently performs all of its transformations
>> during parsing.
>> I also noticed that the same comment block goes on to say something
>> about "clauselist_selectivity's inability to recognize redundant
>> conditions". Perhaps that is relevant to the problems you were having
>> with selectivity estimation, back when the code was in
>> preprocess_qual_conditions() instead? I have no reason to believe that
>> there should be any redundancy left behind by your transformation, so
>> this is just one possibility to consider.
>> Separately, the commit message of commit 25a9e54d2d says something
>> about how the planner builds RestrictInfos, which seems
>> possibly-relevant. That commit enhanced extended statistics for OR
>> clauses, so the relevant paragraph describes a limitation of extended
>> statistics with OR clauses specifically. I'm just guessing, but it
>> still seems like it might be relevant to the problem you ran into with
>> selectivity estimation. Another possibility to consider.
> I understood what is said about AND clauses in this comment. It seems
> to me that AND clauses saved like (BoolExpr *)
> expr->args->(RestrictInfo *) clauseA->(RestrictInfo *)clauseB lists
> and OR clauses saved like (BoolExpr *) expr -> orclause->(RestrictInfo
> *)clause A->(RestrictInfo *)clause B.
> As I understand it, selectivity is calculated for each expression. But
> I'll exploring it deeper, because I think this place may contain the
> answer to the question, what's wrong with selectivity calculation in
> my patch.

I could move transformation in there (extract_restriction_or_clauses)
and didn't have any problem with selectivity calculation, besides it
also works on the redundant or duplicates stage. So, it looks like:

CREATE TABLE tenk1 (unique1 int, unique2 int, ten int, hundred int);
insert into tenk1 SELECT x,x,x,x FROM generate_series(1,50000) as x;
CREATE INDEX a_idx1 ON tenk1(unique1); CREATE INDEX a_idx2 ON
tenk1(unique2); CREATE INDEX a_hundred ON tenk1(hundred);

explain analyze select * from tenk1 a join tenk1 b on ((a.unique2 = 3 or
a.unique2 = 7));

Nested Loop (cost=0.29..2033.62 rows=100000 width=32) (actual
time=0.090..60.258 rows=100000 loops=1) -> Seq Scan on tenk1 b
(cost=0.00..771.00 rows=50000 width=16) (actual time=0.016..9.747
rows=50000 loops=1) -> Materialize (cost=0.29..12.62 rows=2 width=16)
(actual time=0.000..0.000 rows=2 loops=50000) -> Index Scan using a_idx2
on tenk1 a (cost=0.29..12.62 rows=2 width=16) (actual time=0.063..0.068
rows=2 loops=1) Index Cond: (unique2 = ANY (ARRAY[3, 7])) Planning Time:
8.257 ms Execution Time: 64.453 ms (7 rows)

Overall, this was due to incorrectly defined types of elements in the
array, and if we had applied the transformation with the definition of
the tup operator, we could have avoided such problems (I used
make_scalar_array_op and have not yet found an alternative to this).

When I moved the transformation on the index creation stage, it couldn't
work properly and as a result I faced the same problem of selectivity
calculation. I supposed that the selectivity values are also used there,
and not recalculated all over again. perhaps we can solve this by
forcibly recalculating the selectivity values, but I foresee other
problems there.

explain analyze select * from tenk1 a join tenk1 b on ((a.unique2 = 3 or
a.unique2 = 7));

Nested Loop (cost=12.58..312942.91 rows=24950000 width=32) (actual
time=0.040..47.582 rows=100000 loops=1) -> Seq Scan on tenk1 b
(cost=0.00..771.00 rows=50000 width=16) (actual time=0.009..7.039
rows=50000 loops=1) -> Materialize (cost=12.58..298.16 rows=499
width=16) (actual time=0.000..0.000 rows=2 loops=50000) -> Bitmap Heap
Scan on tenk1 a (cost=12.58..295.66 rows=499 width=16) (actual
time=0.025..0.028 rows=2 loops=1) Recheck Cond: ((unique2 = 3) OR
(unique2 = 7)) Heap Blocks: exact=1 -> BitmapOr (cost=12.58..12.58
rows=500 width=0) (actual time=0.023..0.024 rows=0 loops=1) -> Bitmap
Index Scan on a_idx2 (cost=0.00..6.17 rows=250 width=0) (actual
time=0.019..0.019 rows=1 loops=1) Index Cond: (unique2 = 3) -> Bitmap
Index Scan on a_idx2 (cost=0.00..6.17 rows=250 width=0) (actual
time=0.003..0.003 rows=1 loops=1) Index Cond: (unique2 = 7) Planning
Time: 0.401 ms Execution Time: 51.350 ms (13 rows)

I have attached a diff file so far, but it is very raw and did not pass
all regression tests (I attached regression.diff) and even had bad
conversion cases (some of the cases did not work at all, in other cases
there were no non-converted nodes). But now I see an interesting
transformation, which was the most interesting for me.

(tenthous = 1 OR tenthous = 3 OR tenthous = 42); - QUERY PLAN
- Bitmap Heap Scan on tenk1 - Recheck Cond: (((thousand = 42) AND
(tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3)) OR ((thousand =
42) AND (tenthous = 42))) - -> BitmapOr - -> Bitmap Index Scan on
tenk1_thous_tenthous - Index Cond: ((thousand = 42) AND (tenthous = 1))
- -> Bitmap Index Scan on tenk1_thous_tenthous - Index Cond: ((thousand
= 42) AND (tenthous = 3)) - -> Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: ((thousand = 42) AND (tenthous = 42)) -(9 rows) + QUERY
+ Index Scan using tenk1_thous_tenthous on tenk1 + Index Cond:
((thousand = 42) AND (tenthous = ANY (ARRAY[1, 3, 42]))) +(2 rows)

Attachment Content-Type Size
regresssion.diff text/x-patch 26.4 KB
diff_fix_sel.diff text/x-patch 10.6 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-08-17 10:18:42 RE: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Andy Fan 2023-08-17 09:07:31 Re: Extract numeric filed in JSONB more effectively