Re: POC, WIP: OR-clause support for indexes

From: Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2023-08-09 11:33:31
Message-ID: fa5f57e1-4051-93d8-5f54-f1e2f4587cc2@yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi! Thank you for your research, I'm sure it will help me to fix the
problem of calculating selectivity faster)
I'm sorry I didn't answer right away, to be honest, I had a full diary
of urgent matters at work. For this reason, I didn't have enough time to
work on this patch properly.

> The optimizer will itself do a limited form of "normalizing to CNF".
> Are you familiar with extract_restriction_or_clauses(), from
> orclauses.c? Comments above the function have an example of how this
> can work:
>
> * Although a join clause must reference multiple relations overall,
> * an OR of ANDs clause might contain sub-clauses that reference just one
> * relation and can be used to build a restriction clause for that rel.
> * For example consider
> * WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45));
> * We can transform this into
> * WHERE ((a.x = 42 AND b.y = 43) OR (a.x = 44 AND b.z = 45))
> * AND (a.x = 42 OR a.x = 44)
> * AND (b.y = 43 OR b.z = 45);
> * which allows the latter clauses to be applied during the scans of a and b,
> * perhaps as index qualifications, and in any case reducing the number of
> * rows arriving at the join. In essence this is a partial transformation to
> * CNF (AND of ORs format). It is not complete, however, because we do not
> * unravel the original OR --- doing so would usually bloat the qualification
> * expression to little gain.
This is an interesting feature. I didn't notice this function before, I
studied many times consider_new_or_cause, which were called there. As
far as I know, there is a selectivity calculation going on there, but as
far as I remember, I called it earlier after my conversion, and
unfortunately it didn't solve my problem with calculating selectivity.
I'll reconsider it again, maybe I can find something I missed.
> Of course this immediately makes me wonder: shouldn't your patch be
> able to perform an additional transformation here? You know, by
> transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
> haven't checked for myself, I assume that this doesn't happen right
> now, since your patch currently performs all of its transformations
> during parsing.
>
> I also noticed that the same comment block goes on to say something
> about "clauselist_selectivity's inability to recognize redundant
> conditions". Perhaps that is relevant to the problems you were having
> with selectivity estimation, back when the code was in
> preprocess_qual_conditions() instead? I have no reason to believe that
> there should be any redundancy left behind by your transformation, so
> this is just one possibility to consider.
> Separately, the commit message of commit 25a9e54d2d says something
> about how the planner builds RestrictInfos, which seems
> possibly-relevant. That commit enhanced extended statistics for OR
> clauses, so the relevant paragraph describes a limitation of extended
> statistics with OR clauses specifically. I'm just guessing, but it
> still seems like it might be relevant to the problem you ran into with
> selectivity estimation. Another possibility to consider.

I understood what is said about AND clauses in this comment. It seems to
me that AND clauses saved like (BoolExpr *) expr->args->(RestrictInfo *)
clauseA->(RestrictInfo *)clauseB lists and OR clauses saved like
(BoolExpr *) expr -> orclause->(RestrictInfo *)clause A->(RestrictInfo
*)clause B.

As I understand it, selectivity is calculated for each expression. But
I'll exploring it deeper, because I think this place may contain the
answer to the question, what's wrong with selectivity calculation in my
patch.

> BTW, I sometimes use RR to help improve my understanding of the planner:
>
> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Recording_Postgres_using_rr_Record_and_Replay_Framework
> The planner has particularly complicated control flow, which has
> unique challenges -- just knowing where to begin can be difficult
> (unlike most other areas). I find that setting watchpoints to see when
> and where the planner modifies state using RR is far more useful than
> it would be with regular GDB. Once I record a query, I find that I can
> "map out" what happens in the planner relatively easily.
Thank you for sharing this source! I didn't know about this before, and
it will definitely make my life easier to understand the optimizer.

I understand what you mean, and I researched the optimizer in a similar
way through gdb and looked at the comments and code in postgresql. This
is a complicated way and I didn't always understand correctly what this
variable was doing in this place, and this created some difficulties for me.

So, thank you for the link!

> Many interesting cases won't get SAOP transformation from the patch,
> simply because of the or_transform_limit GUC's default of 500. I don't
> think that that design makes too much sense. It made more sense back
> when the focus was on expression evaluation overhead. But that's only
> one of the benefits that we now expect from the patch, right? So it
> seems like something that should be revisited soon.
>
> I'm not suggesting that there is no need for some kind of limit. But
> it seems like a set of heuristics might be a better approach. Although
> I would like to get a better sense of the costs of the transformation
> to be able to say too much more.

Yes, this may be revised in the future after some transformations.
Initially, I was solving the problem described here [0]. So, after
testing [1], I come to the conclusion that 500 is the ideal value for
or_transform_limit.

[0]
https://www.postgresql.org/message-id/919bfbcb-f812-758d-d687-71f89f0d9a68%40postgrespro.ru

[1]
https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru

--
Regards,
Alena Rybakina
Postgres Professional

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2023-08-09 11:38:07 Re: A failure in 031_recovery_conflict.pl on Debian/s390x
Previous Message Masahiro Ikeda 2023-08-09 11:10:42 Re: Support to define custom wait events for extensions