Re: Planner creating ineffective plans on LEFT OUTER joins

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andres Freund" <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planner creating ineffective plans on LEFT OUTER joins
Date: 2008-06-26 16:36:15
Message-ID: 603c8f070806260936s905b2cfp48ec1e620a084937@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> IMHO we should have a single parameter which indicates how much planning
> time we consider acceptable for this query. e.g.
>
> optimization_level = 2 (default), varies 1-3
>
> Most automatic optimisation systems allow this kind of setting, whether
> it be a DBMS, or compilers (e.g. gcc).

It's my understanding that the philosophy of the PGDG in the past has
been to avoid putting any kind of hints into the system, focusing
rather an improving the planning of queries. A quick Google search
turns up, for example:

http://archives.postgresql.org/pgsql-performance/2003-12/msg00181.php

Now, perhaps the thinking on this has changed, but a global knob like
this strikes me as a bad idea. If Tom is right that improving the
plan on queries like this would result in an exponential increase in
planning time, then it's certainly important not to paint with too
broad a brush. It would really be best to be able to tell the planner
which specific part of the query may be susceptible to this type of
optimization, because you could easily have many places in a
complicated query that would need to be analyzed, and if the planning
time is going to be a problem then we don't want to overplan the
entire query just to fix the problem in one particular spot. And we
certainly don't want to do a whole bunch of other, unrelated,
expensive optimizations at the same time.

If one were to add a hint, I think the hint should tell the planner:
Hey, see this left join? Well, computing the right-hand side of this
thing is going to take forever unless we get some information to help
us out. So please do all of your limit and filter operations on the
left-hand side first, and then if you have any rows left, then
evaluate the right-hand side for just the values that matter. i.e. in
the example query:

SELECT * FROM ab LEFT OUTER JOIN (bc JOIN cd ON bc.c = cd.d) ON ab.b =
bc.b WHERE ab.a = 20000

...please look up the rows in ab where ab.a = 20000. If you find any,
then make a hash table of all the values you find for b among those
rows. Then when you evaluate (bc JOIN cd ON bc.c = cd.d) you can
filter bc for rows where bc.b is in the hash table.

This might not be a good query plan in the average case, but there are
definitely instances where you might want to force this behavior. In
fact, even if you had to do it as a nested loop (re-evaluating the bc
JOIN cd clause for each possible value of b) there are still cases
where it would be a big win. Of course the nicest thing would be for
the planner to realize on its own that the right-hand side of the join
is going to generate a gazillion rows and the left-hand side is going
to generate one, but maybe (as Tom and the OP suggested) that is
expecting too much (though I confess I don't quite see why).

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2008-06-26 16:38:21 Re: Latest on CITEXT 2.0
Previous Message Josh Berkus 2008-06-26 16:29:44 Re: [0/4] Proposal of SE-PostgreSQL patches