Re: add_path optimization

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: add_path optimization
Date: 2009-03-01 04:22:45
Message-ID: 603c8f070902282022w7702aebdv151b1bd14dddd744@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 27, 2009 at 11:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> One other thought to roll around in your head: at the time that the
> current add_path logic was designed, compare_pathkeys was ungodly
> expensive, which is why the code tries to compare costs first.
> We've since introduced the "canonical pathkey" representation, which
> makes compare_pathkeys enough cheaper that it might not be insane
> to do it in the other order.  I don't immediately see an easy win
> there, because as things are structured we typically want the cost
> comparison result anyway for list sorting purposes.  But maybe there's
> a way around that.

I think the best think to do with the pathkeys comparison is avoid it
altogether as frequently as possible. One thing I've been thinking
about is trying to refactor add_paths_to_joinrel() so that it doesn't
need to be called twice for every pair of input relations. This saves
some cycles in a few different places: you avoid recomputing the merge
clauses and hash clauses twice, and because you now generate all of
the unsorted hash-paths in the same piece of code, you can do an
add_similar_paths()-type thing where you only compare costs first and
then throw just the winners into the main list. I think there are
some minor economies in sort_inner_and_outer() as well. I haven't
been able to convince myself that all of this adds up to a measurable
win, though.

The problem is sort_inner_and_outer and hash_inner_and_outer actually
generate only a relatively small number of paths. The latter
generates at most two, and the former generates
length(mergeclause_list), which at least for the kinds of queries that
I usually write is most often one. Maybe two? On the other hand,
match_unsorted_outer generates up to 5 nestloops plus potentially
several merge joins PER OUTER PATH. That's a much bigger number.
Ergo, if you want to make join planning fast, you need to make
match_unsorted_outer() consider fewer paths or consider them more
quickly.

And, unfortunately, it's not clear how to do that. I have a nagging
feeling that the logic around when to try materializing the inner path
could be improved. We skip it when the cheapest total inner path is
an index scan, bitmap heap scan, tid scan, material path, function
scan, cte scan, or work table scan (which is an irritatingly long
list... is there a better way to do this?), but cost_nestloop() only
costs the thing differently if the inner side is a material path or
hash path. I sort of wonder if cost_nestloop() could be made to decide
when materialization of the inner path is warranted. That might help
some, but I still think the fact that match_unsorted_outer() generates
a number of paths that is linear in the size of the outer pathlist
(rather than a constant) is going to tend to make it the most
expensive part of join planning.

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hiroshi Saito 2009-03-01 07:09:23 Re: regression test crashes at tsearch
Previous Message Andrew Dunstan 2009-03-01 00:52:05 cardinality()