Re: query optimization scenarios 17,701 times faster!!!

From: "Robert Dyas" <rdyas(at)adelphia(dot)net>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: query optimization scenarios 17,701 times faster!!!
Date: 2003-04-24 21:08:58
Message-ID: MGEFJOBFIEAIADIKAMEKEEJMCIAA.rdyas@adelphia.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Actually, what I'm suggesting first is much more straight forward. Each
operation must take place on a row set. When you've got a bunch of tables
that are joined together, and columns with unique indexes are specified in
the where clause to be equal to some simple value (regalrdless of which
table), then before you do any other processing, make sure you only process
on that single row! Don't join a million rows together only to throw them
all out but one!! Which it appears is exactly what 7.3.1 was doing. It
should be so easy to scan a where clause for columns that contain unique
indexes and are equal to a fixed specified value. If so, in all cases, use
that to limit rows before you do any other processing. That is the simple
case, but that alone will have a major impact on a good percentage of real
world queries. It is a simple optimization and will work in all cases, no
exceptions. If you join tables before ever considering the where clause
restrictions, you've got to be doing a ton of extra work in many, many
cases.

If the above makes sense, then we can talk about the next most general case
(i.e. determining the likely number of rows that will be retunred by a where
clause restriction on a column). But the concept is still the same - don't
spend time joining rows when you are going to later throw them away becuase
of some where clause restriction. Stated another way, "restrict then join"
is far more efficient than "join then restrict".

-----Original Message-----
From: Stephan Szabo [mailto:sszabo(at)megazone23(dot)bigpanda(dot)com]
Sent: Thursday, April 24, 2003 4:39 PM
To: Robert Dyas
Subject: RE: [HACKERS] query optimization scenarios 17,701 times
faster!!!

On Thu, 24 Apr 2003, Robert Dyas wrote:

> Actually, I wish I would have sent along only the last pair of queries to
> focus the conversation more. So I will focus on the last pair of queries
for
> this comment.

Okay, that one did seem wierder.

> In that case, the first table specified in the from clause is the table
from
> which all of the columns are retrieved. Furthermore, the where clause
> indicates that the table's primary key must equal a single value. This can
> only return one row (or zero rows). So all subsequent processing should
only
> occur on that one row. That is the simple case involving unique indexes,
and
> the easiest to optimize.

Right, but I was pointing out that the particular two queries you gave
aren't necessarily the same without that distinct on the query. The fact
that it gives you the same results depends on that you know that only one
(or zero because its outer) rows in the right side will match or that
you're doing something like distinct, distinct on, group by afaics.
Otherwise I think you'd get duplicated rows in the first version and not
in the second.

> The more general case would be an estimate (based on index statistics) of
> the number of rows that would be returned by a column equaling some value.
> Those where the estimated number of hits is sufficiently small relative to
> total table size would be processed first (in the case of multiple
> restrictions, those resulting in the least number of estimated result rows
> would be processed first). I would assume this is one of the first
> optimizations somebody would address when writing a query optimizer, yet
it
> appears to be absent or broken in 7.3.1.

The problem is that explicit join syntax is also overloaded right now to
mean do these joins in this order which does inhibit some optimizations.
I think there's been talk of making this optional at some point.
I believe there were cases with outer joins where (A LJ B) LJ C is not the
same as A LJ (B LJ C) as well (admittedly they're really wierd using
things like is not null or coalesce in the on conditions) and so you have
to be careful when reordering them (if that's part of what you're
suggesting).

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message scott.marlowe 2003-04-24 21:18:42 Re: query optimization scenarios 17,701 times faster!!!
Previous Message Stephan Szabo 2003-04-24 19:17:09 Re: query optimization scenarios 17,701 times faster!!!