Re: Performance improvement for joins where outer side is unique

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2016-03-12 20:01:11
Message-ID: CAKFQuwYVKsyhWNZaWpc=4YOs6qBnJE-pRcN9CWh7ExqZyvNNpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Saturday, March 12, 2016, David G. Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
wrote:

> On Saturday, March 12, 2016, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us
> <javascript:_e(%7B%7D,'cvml','tgl(at)sss(dot)pgh(dot)pa(dot)us');>> wrote:
>
>> "David G. Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> writes:
>> > Don't the semantics of a SEMI JOIN also state that the output columns
>> only
>> > come from the outer relation? i.e., the inner relation doesn't
>> contribute
>> > either rows or columns to the final result? Or is that simply
>> > an implementation artifact of the fact that the only current way to
>> perform
>> > a semi-join explicitly is via exists/in?
>>
>> I think it's an artifact. What nodes.h actually says about it is you get
>> the values of one randomly-selected matching inner row, which seems like
>> a fine definition for the purposes we plan to put it to.
>>
>>
> But is it a definition that actually materializes anywhere presently?
>
> I'm not sure what we consider an authoritative source but relational
> algebra does define the results of semi and anti joins as only containing
> rows from main relation.
>
> https://en.m.wikipedia.org/wiki/Relational_algebra
>

Pondering it more calling these optimizations "semi" joins further
distances us from the meaning of "semi" as used in relational algebra. The
half that semi refers to IS that only one half of the tables are returned.
That you only get a single row of output regardless of multiple potential
matches is simply a consequence of this and general set theory.

In short "semi" communicates a semantic meaning as to the intended output
of the query irrespective of the data upon which it is executed. We now
are hijacking the and calling something "semi" if by some chance the data
the query is operating against happens to be accommodating to some
particular optimization.

This seems wrong on definitional and cleanliness grounds.

So while I'm still liking the idea of introducing specializations of outer
and inner joins I think calling them "semi" joins adds a definitional
inconsistency we are better off avoiding.

This came about because calling something "outer semi join" struck me as
odd.

Something like "outer only join" and "inner only join" comes to mind.
Consider the parallel between this and "index only scan". Learning that
"only" means "join the outer row to the (at most for outer) one and only
row in the inner relation" doesn't seem to much of a challenge.

David J.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-03-12 20:38:17 Re: amcheck (B-Tree integrity checking tool)
Previous Message Andres Freund 2016-03-12 19:46:25 Re: [COMMITTERS] pgsql: Only try to push down foreign joins if the user mapping OIDs mat