Re: improving a badly optimized query

From: Brandon Craig Rhodes <brandon(at)oit(dot)gatech(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: improving a badly optimized query
Date: 2002-11-22 15:51:12
Message-ID: w67kf5eh73.fsf@guinness.ts.gatech.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Brandon Craig Rhodes <brandon(at)oit(dot)gatech(dot)edu> writes:
> > (a) (slow)
> > SELECT * FROM role_keys NATURAL LEFT JOIN role_person
> > WHERE person = 28389;
>
> > (b) (fast)
> > SELECT * FROM role_keys NATURAL JOIN role_person
> > WHERE person = 28389;
>
> > ... the rows created for unmatched role_keys rows by the LEFT JOIN
> > are guaranteed to be thrown out by the WHERE clause ... [so] the
> > LEFT JOIN could be reduced to a JOIN
>
> It seems like a useful optimization, but I have an uncomfortable
> feeling that there's something wrong with it. Can you point to a
> rigorous proof that this is okay in complicated contexts such as
> nested outer joins?

We can optimize the above query simply by observing that the result of
a LEFT JOIN includes both the rows that would have been produced by a
simple JOIN, and those rows of the left table that did not match any
from the right. That is:

SELECT * FROM role_keys NATURAL LEFT JOIN role_person WHERE person = 28389;

is the equivalent of:

SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389
UNION
SELECT * FROM
(SELECT role, CAST(NULL AS integer) AS person FROM role_keys) AS rp
WHERE person = 28389
EXCEPT
SELECT role, CAST(NULL AS integer) AS person FROM role_person;

Since the parenthesized sub-select generates only NULL person fields,
the WHERE condition will yield an empty result, so the entire query
can be reduced to:

SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389;

Similar technique will yield transforms for RIGHT and OUTER JOINs that
can reveal similar optimizations.

I should state our original intention:

We wanted to gain simplicity and save space by inserting rows into
role_person only for those roles associated with people, leaving all
other roles unmentioned; but obviously a view that contains all roles
and mentions their people requires a LEFT JOIN, which kills query
performance. There are three alternatives:

(1) Mention every role in the role_person table, using NULLs for
those roles without associated people, and then our view can be
implemented with a simple JOIN rather than an expensive LEFT JOIN.
(This wastes space and we find it logically unattractive, as well as
more difficult to maintain.)

(2) Build queries on top of the actual tables, instead of the view,
choosing for each query the least sufficient JOIN. (This destroys
the usefulness of views, and makes our queries more complex since we
must hand-craft each JOIN to fit the query.)

(3) Include the above optimization into at least our own copy of
PostgreSQL. (This will be our first experience dealing directly
with the query optimizer, and we are not sure where to begin.)

Please advise,
--
Brandon Craig Rhodes http://www.rhodesmill.org/brandon
Georgia Tech brandon(at)oit(dot)gatech(dot)edu

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message scott.marlowe 2002-11-22 16:06:50 Re: Changing the type of a column in an already populated
Previous Message Stephan Szabo 2002-11-22 15:48:39 Re: Lack of use of indexes