Re: IN vs EXISTS equivalence

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: IN vs EXISTS equivalence
Date: 2008-08-11 16:59:27
Message-ID: 6531.1218473967@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I believe that the optimizable cases for EXISTS are those where the
>> EXISTS() is either at the top level of WHERE, or just underneath a
>> NOT,

> The rest of the plan makes sense to me, but this part seems narrow.
> There's probably a good reason for that which is beyond my depth, but
> attached is a view that is used for calculating statistics for a
> database which is primarily used for "case management" purposes. If
> EXISTS could also be optimized in the contexts used there, it would be
> great.

I chewed on that for awhile. We can certainly optimize EXISTS that's
appearing in the ON-condition of a regular JOIN, because that's not
really semantically different from a WHERE-condition. But I don't think
it's going to be reasonable to improve EXISTS in outer-JOIN ON
conditions. There are a couple of problems. Consider

t1 LEFT JOIN t2
ON (t1.f1 = t2.f2 AND
EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy))

To implement this with the correct semantics, we'd have to have the
t1/t2 join and the t1/t2/t3 join going on in the same execution node,
with two different join behaviors (LEFT and SEMI). There isn't any
way AFAICS to factor it into two separate steps. That's unreasonably
complicated, and it's not clear that you'd get any better performance
anyway than the current implementation (which'd treat the EXISTS as
a subplan).

Even worse, let the EXISTS be a degenerate case:

t1 LEFT JOIN t2
ON (t1.f1 = t2.f2 AND
EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx));

We can't actually treat this EXISTS as a semijoin between t1 and t3,
either before or after the LEFT JOIN; because then the behavior would be
to drop t1 rows that have no t3 match, which is not what this query
specifies.

(Note: the other degenerate case, where the EXISTS depends only on
t2, *could* be optimized since we could just let the semijoin be
performed within the RHS of the LEFT JOIN.)

So this is not something I'm going to tackle; at least not this
devel cycle.

One small step we can take in this direction, though, is to improve the
planner's internal handling of the qual conditions for IN and EXISTS.
Right now the process is just to throw the sub-select into the main
range table and put the IN join conditions into the same place in WHERE
that the IN-clause was to start with. The trouble with this is that the
distribute_quals_to_rels processing has no idea that there's anything
special about the IN join conditions. We got away with that for the
limited case of IN clauses at the top level of WHERE, but it's become
clear to me over the weekend that this has no hope of working for NOT
EXISTS --- since that's effectively an outer join, it absolutely has to
have the same kinds of qual-scheduling constraints as ordinary outer
joins do. So we need a data structure that distribute_quals_to_rels can
work with. What I think needs to happen is that the initial pass that
pulls up optimizable IN/EXISTS sub-selects should not merge the
SubLink's replacement qual clauses seamlessly, but put them underneath a
new node type, say "FlattenedSubLink", that retains knowledge of the
join it's representing. The FlattenedSubLink would survive only as far
as distribute_quals_to_rels, which would distribute out the contained
qual conditions instead of the FlattenedSubLink itself --- but only
after marking them properly for the outer-join restrictions. This
representation would make it feasible to work with IN/EXISTS that are
inside JOIN ON conditions, which the present representation using a
single in_info_list really can't do. The semantic issues are still
there but at least the representation isn't getting in the way ...

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2008-08-11 17:43:36 Re: Type Categories for User-Defined Types
Previous Message Decibel! 2008-08-11 16:47:10 Re: Mini improvement: statement_cost_limit