From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|

To: | pgsql-hackers(at)postgreSQL(dot)org |

Subject: | Optimization rules for semi and anti joins |

Date: | 2009-02-10 20:10:50 |

Message-ID: | 23610.1234296650@sss.pgh.pa.us |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

I wrote (in response to Kevin Grittner's recent issues):

> Reflecting on this further, I suspect there are also some bugs in the

> planner's rules about when semi/antijoins can commute with other joins;

After doing some math I've concluded this is in fact the case. Anyone

want to check my work?

regards, tom lane

-----------------------------------

The basic outer-join identities used up through 8.3 are (as quoted from

optimizer/README):

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)

= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly

Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)

= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)

= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows

(that is, Pbc is strict for at least one column of B).

How do these relate to semi/antijoins?

Semijoin can be rearranged as freely as inner join

--------------------------------------------------

We have these properties:

S1. (A semijoin B on (Pab)) semijoin C on (Pac)

= (A semijoin C on (Pac)) semijoin B on (Pab)

The two semijoins amount to independent filter conditions for each A row,

so we can test them in either order.

S2. (A semijoin B on (Pab)) innerjoin C on (Pac)

= (A innerjoin C on (Pac)) semijoin B on (Pab)

This is only really safe if Pab and Pac are nonvolatile, since we might

evaluate them different numbers of times in the second form, but that is

true for innerjoin associativity as well. Currently, since we only create

semijoins from EXISTS subqueries with nonvolatile WHERE clauses, we know

a priori that Pab is nonvolatile; and we have never bothered to worry

about the volatility of innerjoin clauses, so I don't see that there's

a reason to be extra careful with Pac.

We also have to consider whether semijoin re-associates with outer joins

in the same ways as an inner join does.

S3. (A leftjoin B on (Pab)) semijoin C on (Pac)

= (A semijoin C on (Pac)) leftjoin B on (Pab)

This also works as long as the quals are nonvolatile.

S4. (A antijoin B on (Pab)) semijoin C on (Pac)

= (A semijoin C on (Pac)) antijoin B on (Pab)

Again, these are independent conditions on each A row.

Hence semijoins can be rearranged just as freely as inner joins.

Antijoin is a tad more strict than left join

--------------------------------------------

We have these properties:

A1. (A antijoin B on (Pab)) innerjoin C on (Pac)

= (A innerjoin C on (Pac)) antijoin B on (Pab)

True given nonvolatile quals.

A2. (A antijoin B on (Pab)) antijoin C on (Pac)

= (A antijoin C on (Pac)) antijoin B on (Pab)

Again, these are independent conditions on each A row.

A3? (A antijoin B on (Pab)) antijoin C on (Pbc)

= A antijoin (B antijoin C on (Pbc)) on (Pab)

This one unfortunately fails for antijoins. For example assume that

all three are one-column relations with equality join conditions, and

A = (1), (2)

B = (1)

C = (1)

The antijoin of A and B is (2,NULL), and antijoining that to C

gives (2,NULL,NULL). But the antijoin of B and C is empty, so

the second form produces output (1,NULL,NULL), (2,NULL,NULL).

We also have to consider mixed antijoin/leftjoin cases in identities

2 and 3.

A4. (A antijoin B on (Pab)) leftjoin C on (Pac)

= (A leftjoin C on (Pac)) antijoin B on (Pab)

True given nonvolatile quals.

A5. (A leftjoin B on (Pab)) antijoin C on (Pac)

= (A antijoin C on (Pac)) leftjoin B on (Pab)

This is just a restatement of A4.

A6. (A antijoin B on (Pab)) leftjoin C on (Pbc)

= A antijoin (B leftjoin C on (Pbc)) on (Pab)

The second form is in fact equivalent to null-extending the A/B antijoin

--- the actual contents of C cannot affect the result. So we could just

drop C altogether. (I'm not going to do anything about that now, but

it's something to consider for the planned join-elimination optimization.)

In the first form, if Pbc is strict on B then it must fail for all rows of

the antijoin result so we get the null-extended A/B result. If Pbc is not

strict then the first form might duplicate some rows in the antijoin

result, or produce non-null-extended rows. So in this case the identity

holds only if Pbc is strict, which is the same as for left joins.

A7? (A leftjoin B on (Pab)) antijoin C on (Pbc)

= A leftjoin (B antijoin C on (Pbc)) on (Pab)

This identity fails even if both qual clauses are strict (try it in

the same example as above).

So: identity 3 fails if C is an antijoin, but otherwise antijoins

associate like leftjoins.

The bottom line is that the current code isn't strict enough for antijoins

but is too strict for semijoins. Also, if we fix the second part of that,

we can still allow an EXISTS in the ON condition of an outer join to be

flattened --- it's just the NOT EXISTS case that poses a risk of

getting a worse plan from flattening.

- Re: Optimization rules for semi and anti joins at 2009-02-10 20:41:46 from Dimitri Fontaine
- Re: Optimization rules for semi and anti joins at 2009-02-10 21:51:01 from Robert Haas
- Re: Optimization rules for semi and anti joins at 2009-02-11 01:09:03 from Jonah H. Harris
- Re: Optimization rules for semi and anti joins at 2009-02-11 23:57:48 from Kevin Grittner
- Re: Optimization rules for semi and anti joins at 2009-02-20 15:59:03 from Tom Lane

From | Date | Subject | |
---|---|---|---|

Next Message | Dimitri Fontaine | 2009-02-10 20:41:46 | Re: Optimization rules for semi and anti joins |

Previous Message | Jeff Davis | 2009-02-10 20:06:45 | advance local xmin more aggressively |