Re: Optimization rules for semi and anti joins

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Optimization rules for semi and anti joins
Date: 2009-02-12 13:06:38
Message-ID: 4993CA7E.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> (A semijoin B on (Pab)) antijoin C on (Pbc)
>> = A semijoin (B antijoin C on (Pbc)) on (Pab)
>
>> I think this one is true, and it doesn't seem to be mentioned,
>> unless I'm missing something. It seems potentially useful.
>
> Hmm, it doesn't seem terribly well-defined --- the values of B are
> indeterminate above the semijoin in the first case, so having Pbc
> refer to them doesn't seem like a good idea. In particular, it
> seems like in the first case the semijoin could randomly choose a B
> row that has a join partner in C, causing the A row to disappear
> from the result, when the same A row has another B partner that does
> not join to C --- and the second form would find that B partner and
> allow the A row to be output.

I was looking at it from the abstraction that A semijoin B could be
treated as the equivalent of an inner join with duplicate A rows from
the result removed before the final result of the enclosing query. It
seems you've been interpreting it as meaning the inner join of A to
the first (arbitrarily chosen) row of B found. It appears that these
two views of it generate the same results for the other identities,
but not this one.

The first case here could be implemented as an inner join which
included (in addition to any columns needed for other purposes) a row
identifier for A and all the columns of B which are needed for the Pbc
predicate. The antijoin could be performed on that result, after
which duplicate A rows would be eliminated, as well as the row
identifier and the B columns. A simplified (and only slightly
artificial) example of where this could buy orders of magnitude
improvement in run time follows.

Imagine that A is a Party table with ten million rows. Imagine that B
and C are sets of rows within a hundred million row CaseHist table
which records events on cases, some of which are associated with
parties. B represents warrant issuing events, each of which is
related to a party. C represents warrant disposition events, each of
which is related to a warrant issuing event. Both tables are indexed
on case number and a sequence number. Party has a name index.

You've got a name, and you want a list of outstanding warrants for
parties with a matching name.

The second case above would be the natural way to write the query.
The first case, implemented as I describe, would be orders of
magnitude faster.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message BogDan Vatra 2009-02-12 13:16:21 Re: SE-PostgreSQL and row level security
Previous Message Magnus Hagander 2009-02-12 12:54:13 Re: mingw check hung