Re: POC, WIP: OR-clause support for indexes

From: "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-08-17 10:20:33
Message-ID: 11403645-b342-c400-859e-47d0f41ec22a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry, I didn't write correctly enough, about the second second place in
the code where the conversion works well enough is the removal of
duplicate OR expressions.

I attached patch to learn it in more detail.

On 17.08.2023 13:08, a.rybakina wrote:
> Hi, all!
>>> The optimizer will itself do a limited form of "normalizing to CNF".
>>> Are you familiar with extract_restriction_or_clauses(), from
>>> orclauses.c? Comments above the function have an example of how this
>>> can work:
>>>
>>>   * Although a join clause must reference multiple relations overall,
>>>   * an OR of ANDs clause might contain sub-clauses that reference
>>> just one
>>>   * relation and can be used to build a restriction clause for that
>>> rel.
>>>   * For example consider
>>>   *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45));
>>>   * We can transform this into
>>>   *      WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45))
>>>   *          AND (a.x = 42 OR a.x = 44)
>>>   *          AND (b.y = 43 OR b.z = 45);
>>>   * which allows the latter clauses to be applied during the scans
>>> of a and b,
>>>   * perhaps as index qualifications, and in any case reducing the
>>> number of
>>>   * rows arriving at the join.  In essence this is a partial
>>> transformation to
>>>   * CNF (AND of ORs format).  It is not complete, however, because
>>> we do not
>>>   * unravel the original OR --- doing so would usually bloat the
>>> qualification
>>>   * expression to little gain.
>> This is an interesting feature. I didn't notice this function before,
>> I studied many times consider_new_or_cause, which were called there.
>> As far as I know, there is a selectivity calculation going on there,
>> but as far as I remember, I called it earlier after my conversion,
>> and unfortunately it didn't solve my problem with calculating
>> selectivity. I'll reconsider it again, maybe I can find something I
>> missed.
>>> Of course this immediately makes me wonder: shouldn't your patch be
>>> able to perform an additional transformation here? You know, by
>>> transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
>>> haven't checked for myself, I assume that this doesn't happen right
>>> now, since your patch currently performs all of its transformations
>>> during parsing.
>>>
>>> I also noticed that the same comment block goes on to say something
>>> about "clauselist_selectivity's inability to recognize redundant
>>> conditions". Perhaps that is relevant to the problems you were having
>>> with selectivity estimation, back when the code was in
>>> preprocess_qual_conditions() instead? I have no reason to believe that
>>> there should be any redundancy left behind by your transformation, so
>>> this is just one possibility to consider.
>>> Separately, the commit message of commit 25a9e54d2d says something
>>> about how the planner builds RestrictInfos, which seems
>>> possibly-relevant. That commit enhanced extended statistics for OR
>>> clauses, so the relevant paragraph describes a limitation of extended
>>> statistics with OR clauses specifically. I'm just guessing, but it
>>> still seems like it might be relevant to the problem you ran into with
>>> selectivity estimation. Another possibility to consider.
>>
>> I understood what is said about AND clauses in this comment. It seems
>> to me that AND clauses saved like (BoolExpr *)
>> expr->args->(RestrictInfo *) clauseA->(RestrictInfo *)clauseB lists
>> and OR clauses saved like (BoolExpr *) expr ->
>> orclause->(RestrictInfo *)clause A->(RestrictInfo *)clause B.
>>
>> As I understand it, selectivity is calculated for each expression.
>> But I'll exploring it deeper, because I think this place may contain
>> the answer to the question, what's wrong with selectivity calculation
>> in my patch.
>
> I could move transformation in there (extract_restriction_or_clauses)
> and didn't have any problem with selectivity calculation, besides it
> also works on the redundant or duplicates stage. So, it looks like:
>
> CREATE TABLE tenk1 (unique1 int, unique2 int, ten int, hundred int);
> insert into tenk1 SELECT x,x,x,x FROM generate_series(1,50000) as x;
> CREATE INDEX a_idx1 ON tenk1(unique1); CREATE INDEX a_idx2 ON
> tenk1(unique2); CREATE INDEX a_hundred ON tenk1(hundred);
>
> explain analyze select * from tenk1 a join tenk1 b on ((a.unique2 = 3
> or a.unique2 = 7));
>
> PLAN
> ------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.29..2033.62 rows=100000 width=32) (actual
> time=0.090..60.258 rows=100000 loops=1) -> Seq Scan on tenk1 b
> (cost=0.00..771.00 rows=50000 width=16) (actual time=0.016..9.747
> rows=50000 loops=1) -> Materialize (cost=0.29..12.62 rows=2 width=16)
> (actual time=0.000..0.000 rows=2 loops=50000) -> Index Scan using
> a_idx2 on tenk1 a (cost=0.29..12.62 rows=2 width=16) (actual
> time=0.063..0.068 rows=2 loops=1) Index Cond: (unique2 = ANY (ARRAY[3,
> 7])) Planning Time: 8.257 ms Execution Time: 64.453 ms (7 rows)
>
> Overall, this was due to incorrectly defined types of elements in the
> array, and if we had applied the transformation with the definition of
> the tup operator, we could have avoided such problems (I used
> make_scalar_array_op and have not yet found an alternative to this).
>
> When I moved the transformation on the index creation stage, it
> couldn't work properly and as a result I faced the same problem of
> selectivity calculation. I supposed that the selectivity values are
> also used there, and not recalculated all over again. perhaps we can
> solve this by forcibly recalculating the selectivity values, but I
> foresee other problems there.
>
> explain analyze select * from tenk1 a join tenk1 b on ((a.unique2 = 3
> or a.unique2 = 7));
>
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=12.58..312942.91 rows=24950000 width=32) (actual
> time=0.040..47.582 rows=100000 loops=1) -> Seq Scan on tenk1 b
> (cost=0.00..771.00 rows=50000 width=16) (actual time=0.009..7.039
> rows=50000 loops=1) -> Materialize (cost=12.58..298.16 rows=499
> width=16) (actual time=0.000..0.000 rows=2 loops=50000) -> Bitmap Heap
> Scan on tenk1 a (cost=12.58..295.66 rows=499 width=16) (actual
> time=0.025..0.028 rows=2 loops=1) Recheck Cond: ((unique2 = 3) OR
> (unique2 = 7)) Heap Blocks: exact=1 -> BitmapOr (cost=12.58..12.58
> rows=500 width=0) (actual time=0.023..0.024 rows=0 loops=1) -> Bitmap
> Index Scan on a_idx2 (cost=0.00..6.17 rows=250 width=0) (actual
> time=0.019..0.019 rows=1 loops=1) Index Cond: (unique2 = 3) -> Bitmap
> Index Scan on a_idx2 (cost=0.00..6.17 rows=250 width=0) (actual
> time=0.003..0.003 rows=1 loops=1) Index Cond: (unique2 = 7) Planning
> Time: 0.401 ms Execution Time: 51.350 ms (13 rows)
>
> I have attached a diff file so far, but it is very raw and did not
> pass all regression tests (I attached regression.diff) and even had
> bad conversion cases (some of the cases did not work at all, in other
> cases there were no non-converted nodes). But now I see an interesting
> transformation, which was the most interesting for me.
>
> EXPLAIN (COSTS OFF) SELECT * FROM tenk1 WHERE thousand = 42 AND
> (tenthous = 1 OR tenthous = 3 OR tenthous = 42); - QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------
> - Bitmap Heap Scan on tenk1 - Recheck Cond: (((thousand = 42) AND
> (tenthous = 1)) OR ((thousand = 42) AND (tenthous = 3)) OR ((thousand
> = 42) AND (tenthous = 42))) - -> BitmapOr - -> Bitmap Index Scan on
> tenk1_thous_tenthous - Index Cond: ((thousand = 42) AND (tenthous =
> 1)) - -> Bitmap Index Scan on tenk1_thous_tenthous - Index Cond:
> ((thousand = 42) AND (tenthous = 3)) - -> Bitmap Index Scan on
> tenk1_thous_tenthous - Index Cond: ((thousand = 42) AND (tenthous =
> 42)) -(9 rows) + QUERY PLAN
> +------------------------------------------------------------------------
> + Index Scan using tenk1_thous_tenthous on tenk1 + Index Cond:
> ((thousand = 42) AND (tenthous = ANY (ARRAY[1, 3, 42]))) +(2 rows)
>

Attachment Content-Type Size
diff_fix_sel1.diff text/x-patch 8.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-08-17 10:39:54 Re: Synchronizing slots from primary to standby
Previous Message Hayato Kuroda (Fujitsu) 2023-08-17 10:18:42 RE: [PoC] pg_upgrade: allow to upgrade publisher node