Optimization rules for semi and anti joins

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Optimization rules for semi and anti joins
Date: 2009-02-10 20:10:50
Message-ID: 23610.1234296650@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote (in response to Kevin Grittner's recent issues):
> Reflecting on this further, I suspect there are also some bugs in the
> planner's rules about when semi/antijoins can commute with other joins;

After doing some math I've concluded this is in fact the case. Anyone
want to check my work?

regards, tom lane

-----------------------------------

The basic outer-join identities used up through 8.3 are (as quoted from
optimizer/README):

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B).

How do these relate to semi/antijoins?

Semijoin can be rearranged as freely as inner join
--------------------------------------------------

We have these properties:

S1. (A semijoin B on (Pab)) semijoin C on (Pac)
= (A semijoin C on (Pac)) semijoin B on (Pab)

The two semijoins amount to independent filter conditions for each A row,
so we can test them in either order.

S2. (A semijoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) semijoin B on (Pab)

This is only really safe if Pab and Pac are nonvolatile, since we might
evaluate them different numbers of times in the second form, but that is
true for innerjoin associativity as well. Currently, since we only create
semijoins from EXISTS subqueries with nonvolatile WHERE clauses, we know
a priori that Pab is nonvolatile; and we have never bothered to worry
about the volatility of innerjoin clauses, so I don't see that there's
a reason to be extra careful with Pac.

We also have to consider whether semijoin re-associates with outer joins
in the same ways as an inner join does.

S3. (A leftjoin B on (Pab)) semijoin C on (Pac)
= (A semijoin C on (Pac)) leftjoin B on (Pab)

This also works as long as the quals are nonvolatile.

S4. (A antijoin B on (Pab)) semijoin C on (Pac)
= (A semijoin C on (Pac)) antijoin B on (Pab)

Again, these are independent conditions on each A row.

Hence semijoins can be rearranged just as freely as inner joins.

Antijoin is a tad more strict than left join
--------------------------------------------

We have these properties:

A1. (A antijoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) antijoin B on (Pab)

True given nonvolatile quals.

A2. (A antijoin B on (Pab)) antijoin C on (Pac)
= (A antijoin C on (Pac)) antijoin B on (Pab)

Again, these are independent conditions on each A row.

A3? (A antijoin B on (Pab)) antijoin C on (Pbc)
= A antijoin (B antijoin C on (Pbc)) on (Pab)

This one unfortunately fails for antijoins. For example assume that
all three are one-column relations with equality join conditions, and
A = (1), (2)
B = (1)
C = (1)
The antijoin of A and B is (2,NULL), and antijoining that to C
gives (2,NULL,NULL). But the antijoin of B and C is empty, so
the second form produces output (1,NULL,NULL), (2,NULL,NULL).

We also have to consider mixed antijoin/leftjoin cases in identities
2 and 3.

A4. (A antijoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) antijoin B on (Pab)

True given nonvolatile quals.

A5. (A leftjoin B on (Pab)) antijoin C on (Pac)
= (A antijoin C on (Pac)) leftjoin B on (Pab)

This is just a restatement of A4.

A6. (A antijoin B on (Pab)) leftjoin C on (Pbc)
= A antijoin (B leftjoin C on (Pbc)) on (Pab)

The second form is in fact equivalent to null-extending the A/B antijoin
--- the actual contents of C cannot affect the result. So we could just
drop C altogether. (I'm not going to do anything about that now, but
it's something to consider for the planned join-elimination optimization.)
In the first form, if Pbc is strict on B then it must fail for all rows of
the antijoin result so we get the null-extended A/B result. If Pbc is not
strict then the first form might duplicate some rows in the antijoin
result, or produce non-null-extended rows. So in this case the identity
holds only if Pbc is strict, which is the same as for left joins.

A7? (A leftjoin B on (Pab)) antijoin C on (Pbc)
= A leftjoin (B antijoin C on (Pbc)) on (Pab)

This identity fails even if both qual clauses are strict (try it in
the same example as above).

So: identity 3 fails if C is an antijoin, but otherwise antijoins
associate like leftjoins.

The bottom line is that the current code isn't strict enough for antijoins
but is too strict for semijoins. Also, if we fix the second part of that,
we can still allow an EXISTS in the ON condition of an outer join to be
flattened --- it's just the NOT EXISTS case that poses a risk of
getting a worse plan from flattening.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2009-02-10 20:41:46 Re: Optimization rules for semi and anti joins
Previous Message Jeff Davis 2009-02-10 20:06:45 advance local xmin more aggressively