Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2021-05-16 02:03:07
Message-ID: CAKU4AWp-O6goWhtVggUd1Lh=x90W5OMazUh0s71KkRC1u5dKJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 18, 2019 at 8:09 AM Noah Misch <noah(at)leadboat(dot)com> wrote:

> On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote:
> > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> > >> FWIW, my problem with this patch is that I remain unconvinced of the
> basic
> > >> correctness of the transform (specifically the unique-ification
> approach).
> > >> Noah's points would be important to address if we were moving the
> patch
> > >> towards commit, but I don't see much reason to put effort into it
> until
> > >> we can think of a way to prove whether that works.
> >
> > > Not even effort to fix the assertion failures I reported?
> >
> > If it seemed relevant to the proof-of-correctness problem, I would look
> > into it, but ...
>
I put some hours into theoretical study of the proof, and I didn't find any
> holes. When the planner removes "l0 LEFT JOIN r1", it does that because
> there's one output row per l0.ctid regardless of the rows of r1. Hence,
> deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid,
> r1.ctid). In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be
> sufficient
> as a key, but we're out of luck because removing the join makes us have
> only
> l0.ctid for some tuples and only r1.ctid for other tuples.
>
> If PostgreSQL ever gets inner join removal, it would be harder to preserve
> this optimization in this form. At that point, perhaps we'd cost the path
> that retains the join for the benefit of $SUBJECT. Given the performance
> test
> results, $SUBJECT may already need a cost-based decision on whether to use
> it.
>
>
As for the proof-of-correctness problem, I think the below strategy should
be
easy to understand.

SELECT * FROM any_v WHERE (A OR B OR C).
=>
SELECT * FROM any_v WHERE A
UNION ALL
SELECT * FROM any_v WHERE B AND !A
UNION ALL
SELECT * FROM any_v WHERE C AND !A AND !(B AND !A);
where !(Expr) means (NOT (expr) OR (EXPR) IS NULL)
In this way, there is no duplicated data at the first. Oracle uses a similar
way like this[1].

- A shared planner info idea.

About the planning time case, suppose we have a query below.

SELECT * FROM t1, t2, ... t20 WHERE (t1.a = 1 or t2.a = 1) and xxx;

With the current strategy, t3 .. t20 would be planned many times, including
base
relation like (t3.. t20) and joinrel like (t{3, 4}, t(3,4,5}, t(3,4,5,6})
and
so on. I guess all these relations would get the same result at every
time. we
don't need to do that at every section of the Union ALL case. So an idea of
PlannerInfo A can access the planning result of PlannerInfo B comes, this
idea
doesn't come in my mind for the first time, many cost based
transformations may
need this.

Then I have the following POC in my mind.

PlannerInfo
{

PlannerInfo *shared_planner_info;
Relids or_relids; // this field can be improved later.
};

With this way, if we find a 'relids' is not related to or_relids. We can
grab
the planning result from shared_planner_info (in this or-union case, we can
set
that after we plan the first section of UNION).

Then I found this one would not work after planning time partition prune
since
the relid would change. For example:

P (P1, P2) PARTITION BY A
Q (Q1, Q2) PARTITION BY A

SELECT * FROM t, p, q where (t.a = 1 or p.a = 1). So in the un-transformed
case, Q2 would have relid = 7. After the transform, Q2 probably has relid
= 6.

So basically, I have 4 questions now.
1. Would investing on the shared_planner_info idea be a good idea?
2. Without planning time prune, shall the above design work?
3. Currently I want to use relid to identify a resultset and pathlists which
have the planning time prune issue, should we consider other methods?
4. Any other suggestion about to resolve the 'duplicated planning case'
besides
the shared_planner_info method?

[1]
https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion

--
Best Regards
Andy Fan (https://www.aliyun.com/)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-05-16 02:27:08 Re: PG 14 release notes, first draft
Previous Message Justin Pryzby 2021-05-15 23:33:58 Re: PG 14 release notes, first draft