Re: Improve OR conditions on joined columns (common star schema problem)

From: Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2017-03-19 00:29:34
Message-ID: 3605fcb2-6622-6520-b678-83171fe4c9d8@openscg.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/16/17 11:54 AM, David Steele wrote:
> On 2/14/17 4:03 PM, Tom Lane wrote:
>> Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
>>> On 2/14/17 1:18 PM, Tom Lane wrote:
>>>> One point that could use further review is whether the de-duplication
>>>> algorithm is actually correct. I'm only about 95% convinced by the
>>>> argument I wrote in planunionor.c's header comment.
>>
>>> I'll put some thought into it and see if I can find any holes. Are you
>>> only worried about the removal of "useless" rels or is there more?
>>
>> Well, the key point is whether it's really OK to de-dup on the basis
>> of only the CTIDs that are not eliminated in any UNION arm. I was
>> feeling fairly good about that until I thought of the full-join-to-
>> left-join-to-no-join conversion issue mentioned in the comment.
>> Now I'm wondering if there are other holes; or maybe I'm wrong about
>> that one and it's not necessary to be afraid of full joins.
>
> This patch applies cleanly (with offsets) and compiles at cccbdde.
>
> Jim, have you had time to think about this? Any insights?

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?

I'm also finding it a bit hard to follow the comment Tom mentioned. I'm
pretty sure I understand what's going on until this part:

> The identical proof can be expected to apply
> + * in other arms, except in an arm that references that rel in its version
> + * of the OR clause. But in such an arm, we have effectively added a
> + * restriction clause to what is known in other arms, which means that the
> + * set of rows output by that rel can't increase compared to other arms.

AIUI, this is describing a case something like this:

SELECT child.blah FROM child LEFT JOIN parent USING(parent_id)
WHERE child.foo AND (child.baz=1 or child.baz=2)

given that parent.parent_id is unique. Except for these concerns, there
would need to be a complex OR somewhere in here that sometimes
referenced parent and sometimes didn't, such as

WHERE child.foo AND (child.baz=1 OR parent.foo=3)

But I'm not following the logic here (very possibly because I'm wrong
about what I said above):

> + * Therefore the situation in such an arm must be that including the rel
> + * could result in either zero or one output row, rather than exactly one
> + * output row as in other arms. So we still don't need to consider it for
> + * de-duplication.

I'm definitely not certain about the rest of it.

Tom, could you expand the description some, especially with some examples?
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2017-03-19 00:51:42 Re: PinBuffer() no longer makes use of strategy
Previous Message Tom Lane 2017-03-18 22:57:38 Re: Composite IS NULL again, this time with plpgsql