Skip site navigation (1)
Skip section navigation (2)
## Optimization rules for semi and anti joins

### Responses

### pgsql-hackers by date

I wrote (in response to Kevin Grittner's recent issues): > Reflecting on this further, I suspect there are also some bugs in the > planner's rules about when semi/antijoins can commute with other joins; After doing some math I've concluded this is in fact the case. Anyone want to check my work? regards, tom lane ----------------------------------- The basic outer-join identities used up through 8.3 are (as quoted from optimizer/README): 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) = (A innerjoin C on (Pac)) leftjoin B on (Pab) where Pac is a predicate referencing A and C, etc (in this case, clearly Pac cannot reference B, or the transformation is nonsensical). 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) = (A leftjoin C on (Pac)) leftjoin B on (Pab) 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) = A leftjoin (B leftjoin C on (Pbc)) on (Pab) Identity 3 only holds if predicate Pbc must fail for all-null B rows (that is, Pbc is strict for at least one column of B). How do these relate to semi/antijoins? Semijoin can be rearranged as freely as inner join -------------------------------------------------- We have these properties: S1. (A semijoin B on (Pab)) semijoin C on (Pac) = (A semijoin C on (Pac)) semijoin B on (Pab) The two semijoins amount to independent filter conditions for each A row, so we can test them in either order. S2. (A semijoin B on (Pab)) innerjoin C on (Pac) = (A innerjoin C on (Pac)) semijoin B on (Pab) This is only really safe if Pab and Pac are nonvolatile, since we might evaluate them different numbers of times in the second form, but that is true for innerjoin associativity as well. Currently, since we only create semijoins from EXISTS subqueries with nonvolatile WHERE clauses, we know a priori that Pab is nonvolatile; and we have never bothered to worry about the volatility of innerjoin clauses, so I don't see that there's a reason to be extra careful with Pac. We also have to consider whether semijoin re-associates with outer joins in the same ways as an inner join does. S3. (A leftjoin B on (Pab)) semijoin C on (Pac) = (A semijoin C on (Pac)) leftjoin B on (Pab) This also works as long as the quals are nonvolatile. S4. (A antijoin B on (Pab)) semijoin C on (Pac) = (A semijoin C on (Pac)) antijoin B on (Pab) Again, these are independent conditions on each A row. Hence semijoins can be rearranged just as freely as inner joins. Antijoin is a tad more strict than left join -------------------------------------------- We have these properties: A1. (A antijoin B on (Pab)) innerjoin C on (Pac) = (A innerjoin C on (Pac)) antijoin B on (Pab) True given nonvolatile quals. A2. (A antijoin B on (Pab)) antijoin C on (Pac) = (A antijoin C on (Pac)) antijoin B on (Pab) Again, these are independent conditions on each A row. A3? (A antijoin B on (Pab)) antijoin C on (Pbc) = A antijoin (B antijoin C on (Pbc)) on (Pab) This one unfortunately fails for antijoins. For example assume that all three are one-column relations with equality join conditions, and A = (1), (2) B = (1) C = (1) The antijoin of A and B is (2,NULL), and antijoining that to C gives (2,NULL,NULL). But the antijoin of B and C is empty, so the second form produces output (1,NULL,NULL), (2,NULL,NULL). We also have to consider mixed antijoin/leftjoin cases in identities 2 and 3. A4. (A antijoin B on (Pab)) leftjoin C on (Pac) = (A leftjoin C on (Pac)) antijoin B on (Pab) True given nonvolatile quals. A5. (A leftjoin B on (Pab)) antijoin C on (Pac) = (A antijoin C on (Pac)) leftjoin B on (Pab) This is just a restatement of A4. A6. (A antijoin B on (Pab)) leftjoin C on (Pbc) = A antijoin (B leftjoin C on (Pbc)) on (Pab) The second form is in fact equivalent to null-extending the A/B antijoin --- the actual contents of C cannot affect the result. So we could just drop C altogether. (I'm not going to do anything about that now, but it's something to consider for the planned join-elimination optimization.) In the first form, if Pbc is strict on B then it must fail for all rows of the antijoin result so we get the null-extended A/B result. If Pbc is not strict then the first form might duplicate some rows in the antijoin result, or produce non-null-extended rows. So in this case the identity holds only if Pbc is strict, which is the same as for left joins. A7? (A leftjoin B on (Pab)) antijoin C on (Pbc) = A leftjoin (B antijoin C on (Pbc)) on (Pab) This identity fails even if both qual clauses are strict (try it in the same example as above). So: identity 3 fails if C is an antijoin, but otherwise antijoins associate like leftjoins. The bottom line is that the current code isn't strict enough for antijoins but is too strict for semijoins. Also, if we fix the second part of that, we can still allow an EXISTS in the ON condition of an outer join to be flattened --- it's just the NOT EXISTS case that poses a risk of getting a worse plan from flattening.

- Re: Optimization rules for semi and anti joins at 2009-02-10 20:41:46 from Dimitri Fontaine
- Re: Optimization rules for semi and anti joins at 2009-02-10 21:51:01 from Robert Haas
- Re: Optimization rules for semi and anti joins at 2009-02-11 01:09:03 from Jonah H. Harris
- Re: Optimization rules for semi and anti joins at 2009-02-11 23:57:48 from Kevin Grittner
- Re: Optimization rules for semi and anti joins at 2009-02-20 15:59:03 from Tom Lane

Next: From:Dimitri FontaineDate:2009-02-10 20:41:46Subject: Re: Optimization rules for semi and anti joinsPrevious: From: Jeff DavisDate: 2009-02-10 20:06:45Subject: advance local xmin more aggressively