Re: POC, WIP: OR-clause support for indexes

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-10-26 20:41:54
Message-ID: 6b1a967f-5493-470e-96d8-92bbbcad630a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.10.2023 22:58, Robert Haas wrote:
> On Thu, Oct 26, 2023 at 3:47 PM Alena Rybakina
> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> With small amounts of "OR" elements, the cost of orexpr is lower than with "ANY", on the contrary, higher.
> Alexander's example seems to show that it's not that simple. If I'm
> reading his example correctly, with things like aid = 1, the
> transformation usually wins even if the number of things in the OR
> expression is large, but with things like aid + 1 * bid = 1, the
> transformation seems to lose at least with larger numbers of items. So
> it's not JUST the number of OR elements but also what they contain,
> unless I'm misunderstanding his point.
Yes, I agree, with Alexander's example, this option will not help and
here I need to look inside Expr itself. But I noticed that such a
complex non-constant expression is always an OpExpr type, otherwise if
the non-constant part contains only one variable, then it is a Var type.
We can add a constraint that we will transform expressions with the
simple variables like x=1 or x=2 or x=3, etc., but expressions like
x*1+y=1 or x*2+y=2... we ignore.

But then, we do not consider expressions when the nonconstant part is
always the same for expressions. For example, we could transform x*1+y=1
or x*1+y=2... to x*1+y = ANY([1,2,...]). But I think it's not so
critical, because such cases are rare.
>
>> Index Scan using pg_class_oid_index on pg_class (cost=0.27..2859.42 rows=414 width=68) (actual time=1.504..34.183 rows=260 loops=1)
>> Index Cond: (oid = ANY (ARRAY['1'::oid, '2'::oid, '3'::oid, '4'::oid, '5'::oid, '6'::oid, '7'::oid,
>>
>> Bitmap Heap Scan on pg_class (cost=43835.00..54202.14 rows=414 width=68) (actual time=39.958..41.293 rows=260 loops=1)
>> Recheck Cond: ((oid = '1'::oid) OR (oid = '2'::oid) OR (oid = '3'::oid) OR (oid = '4'::oid) OR (oid =
>>
>> I think we could see which value is lower, and if lower with expressions converted to ANY, then work with it further, otherwise work with the original "OR" expressions. But we still need to make this conversion to find out its cost.
> To me, this sort of output suggests that perhaps the transformation is
> being done in the wrong place. I expect that we have to decide whether
> to convert from OR to = ANY(...) at a very early stage of the planner,
> before we have any idea what the selected path will ultimately be. But
> this output suggests that we want the answer to depend on which kind
> of path is going to be faster, which would seem to argue for doing
> this sort of transformation as part of path generation for only those
> paths that will benefit from it, rather than during earlier phases of
> expression processing/simplification.
>
> I'm not sure I have the full picture here, though, so I might have
> this all wrong.
>
This would be the most ideal option, and to be honest, I like the
conversion at an early stage also because there are no problems with
selectivity or link updates if we changed the structure of RestrictInfo
of relation.

But in terms of calculating which option is better to use transformed or
original, I think this solution might be complicated, since we need not
only to highlight the cases in which the transformation wins in
principle, but also with which types of data it will work best and there
is a risk of missing some cases and we may need the own evaluation
model. Now it's hard for me to come up with something simple.

The cost option seems simpler and clearer to me, but yes, it is
difficult to decide when it is better to do the conversion for the most
correct estimate.

--
Regards,
Alena Rybakina

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-10-26 20:43:33 Recovering from detoast-related catcache invalidations
Previous Message Nathan Bossart 2023-10-26 20:34:33 Re: Atomic ops for unlogged LSN