Re: Re: Improve OR conditions on joined columns (common star schema problem)

From: Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Steele <david(at)pgmasters(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2017-03-20 02:54:53
Message-ID: 6cf3ba25-7dbc-20fe-4ff2-4f60a369a577@openscg.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/19/17 2:32 PM, Tom Lane wrote:
> Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
>> Having thought about it, I share Tom's concerns. And now I'm worried
>> about what happens if there are multiple separate OR clauses. I guess
>> those would be handled by separate UNIONs?
>
> As proposed, the patch would try to optimize by splitting each OR clause
> independently, and would choose whichever way gave the best cost estimate.
> I'm not sure it's possible to do better than that, and even if it is,
> I think improving it could be left for later.

Agreed.

> I'd also considered an approach of de-duping on the basis of all relation
> ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
> in which the rel was eliminated entirely. But that doesn't fix it,
> because in this example the first arm would return (a.ctid, NULL) while
> the second arm would return (NULL, b.ctid), so that the UNION step would
> still fail to detect any duplication. To make it work, we'd have to not
> eliminate the joins, which would pretty much defeat the usefulness of
> the optimization for your original example case.

It might still be worth-while in some circumstances. In your example, if
there were these indexes:

a__id ON a(id), a__x ON a(x)
b__id ON b(id), b__y ON b(y)

then it might be faster to nested loop a__x=42 to b__id=a.id and union
that with b__y=43 nested to a__id=b.id.

That said, now isn't the time to be adding more complexity.

> So full joins definitely break this whole optimization. I think it's okay
> with left joins though, because when starting from "a left join b" it will
> never be possible to remove "a" so we'll always include a.ctid in the
> UNION de-duping step. If b is removed in some arm, then it must be true
> that we get exactly one left-join output row per a row, regardless of the
> contents of b, in that arm. The argument for the patch being okay is
> essentially that we must get exactly one left-join output row per a row,
> regardless of the contents of b, in *every* arm, because the various
> modified versions of the OR clause can't affect that conclusion. In some
> of the arms we might not remove b, and we might even be able to reduce the
> left join to an inner join, but there should still be no more than one
> join output row per a row. That being the case, it should be sufficient
> to de-dup using a.ctid while ignoring b.ctid.

The only case I can think of is: would it be possible (perhaps not
today; maybe in the future) for other parts of the query to affect join
elimination? I can't conceive of how that might happen, but if it did
then it's possible that the elimination would work differently with the
UNION than it would with an OR.

The comment on join_is_removable() does mention that there's other
potentially interesting cases that we can't handle now; it's maybe worth
mentioning

Other than that, I can't see any issues with your logic.

> Any clearer yet?

Absolutely. I think it would be very valuable to include that with the
initial comment in planunionor.c. Join reduction and removal is already
tricky enough to wrap your head around.

Other comments:

> + * is retty mechanical, but we can't do it until we have a RelOptInfo for the

s/retty/pretty/

I suspect that in many systems single-table queries are far more common
than CTEs, so maybe it's worth reversing those two tests in
split_join_or_clauses().

For the Unique path, it would be nice if the code did what would be
necessary to consider a TID hash join, since that means a user could
create the appropriate operator and it would just be picked up.
Certainly not worth much effort at this point though.

> + /*
> + * Must not have any volatile functions in FROM or WHERE (see notes at
> + * head of file).
> + */
> + if (contain_volatile_functions((Node *) parse->jointree))

Is there by chance anywhere else that needs to check that? Maybe worth
adding the info to the Query struct if so.

> + * We insist that all baserels used in the query be plain relations, so

Dumb question... views have already be decomposed at this point, right?

Perhaps less dumb question... what happens if the original query already
had setOps? AIUI setOps work has already been done by the time
split_join_or_clauses() happens; I just want to check that that's OK.

I'm not sure a GUC is worth it... I suspect that any query with multiple
rels and an OR condition is going to be expensive enough that whatever
additional plan time there is won't be noticeable.

I've verified that the patch still applies and make check-world is clean.
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-20 02:57:42 Re: Speed up Clog Access by increasing CLOG buffers
Previous Message Robert Haas 2017-03-20 02:51:41 Re: Partition-wise join for join between (declaratively) partitioned tables