Re: Re: Improve OR conditions on joined columns (common star schema problem)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <jim(dot)nasby(at)openscg(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2017-03-19 19:32:57
Message-ID: 30756.1489951977@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jim Nasby <jim(dot)nasby(at)openscg(dot)com> writes:
> On 3/16/17 11:54 AM, David Steele wrote:
>> On 2/14/17 4:03 PM, Tom Lane wrote:
>>> Well, the key point is whether it's really OK to de-dup on the basis
>>> of only the CTIDs that are not eliminated in any UNION arm. I was
>>> feeling fairly good about that until I thought of the full-join-to-
>>> left-join-to-no-join conversion issue mentioned in the comment.

>> Jim, have you had time to think about this? Any insights?

> Having thought about it, I share Tom's concerns. And now I'm worried
> about what happens if there are multiple separate OR clauses. I guess
> those would be handled by separate UNIONs?

As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.

> Tom, could you expand the description some, especially with some examples?

Here's a counterexample --- admittedly rather artificial, but still a
counterexample --- to applying the optimization when there are FULL joins.
Consider

SELECT count(*)
FROM a FULL JOIN b ON (a.id = b.id)
WHERE (a.x = 42 OR b.y = 43);

and suppose that a and b have mutual FK constraints guaranteeing that
every non-null a.id value has exactly one match in b and vice versa. (For
the purposes of this example, I think it doesn't matter whether or not we
allow there to be null id values.) Also suppose that there are some join
rows satisfying both a.x = 42 and b.y = 43, while other join rows satisfy
only one of the OR arms. The patch would try to implement this query as,
approximately,

SELECT count(*)
FROM
( SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE a.x = 42
UNION-using-ctids
SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE b.y = 43 )

Now in the first UNION arm, we'd observe that the strict a.x = 42
condition allows reduction of the join strength to LEFT JOIN, and then
we'd deduce from the FK on b that the join to b can be dropped altogether.
In the second arm, we'd similarly conclude that a can be dropped
altogether, leaving

SELECT count(*)
FROM
( SELECT FROM a WHERE a.x = 42
UNION-using-ctids
SELECT FROM b WHERE b.y = 43 )

But at this point there are no rels that are not eliminated in any UNION
arm, so that no de-duping would occur in the UNION, meaning that we'd
double-count any join rows in which both a.x = 42 and b.y = 43.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely. But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication. To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

So full joins definitely break this whole optimization. I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step. If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm. The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion. In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row. That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

Any clearer yet?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2017-03-19 19:50:19 Re: Removing binaries (was: createlang/droplang deprecated)
Previous Message Michael Banck 2017-03-19 18:45:34 Re: Create replication slot in pg_basebackup if requested and not yet present