Re: upper planner path-ification

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: upper planner path-ification
Date: 2015-06-23 01:55:20
Message-ID: 9A28C8860F777E439AA12E8AEA7694F8011090B3@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Robert Haas
> Sent: Thursday, May 14, 2015 10:39 AM
> To: pgsql-hackers(at)postgresql(dot)org; Tom Lane
> Subject: [HACKERS] upper planner path-ification
>
> Hi,
>
> I've been pulling over Tom's occasional remarks about redoing
> grouping_planner - and maybe further layers of the planner - to work
> with Paths instead of Plans. I've had difficulty locating all of the
> relevant threads. Here's everything I've found so far, which I'm
> pretty sure is not everything:
>
> http://www.postgresql.org/message-id/17400.1311716272@sss.pgh.pa.us
> http://www.postgresql.org/message-id/2614.1375730493@sss.pgh.pa.us
> http://www.postgresql.org/message-id/22721.1385048662@sss.pgh.pa.us
> http://www.postgresql.org/message-id/BANLkTinDjjFHNOzESG2J2U4GOkqLu69Zqg@mai
> l.gmail.com
> http://www.postgresql.org/message-id/8479.1418420018@sss.pgh.pa.us
>
> I think there are two separate problems here. First, there's the
> problem that grouping_planner() is complicated. It's doing cost
> comparisons, but all in ad-hoc fashion rather than using a consistent
> mechanic the way add_path() does. Generally, we can plan an aggregate
> using either (1) a hashed aggregate, (2) a sorted aggregate, or (3)
> for min or max, an index scan that just grabs the highest or lowest
> value in lieu of a full table scan. Instead of generating a plan for
> each of these possibilities, we'd like to generate paths for each one,
> and then pick one to turn into a plan. AIUI, the hope is that this
> would simplify the cost calculations, and also make it easier to
> inject other paths, such as a path where an FDW performs the
> aggregation step on the remote server.
>
> Second, there's the problem that we might like to order aggregates
> with respect to joins. If we have something like SELECT DISTINCT ON
> (foo.x) foo.x, foo.y, bar.y FROM foo, bar WHERE foo.x = bar.x, then
> (a) if foo.x has many duplicates, it will be better to DISTINCT-ify
> foo and then join to bar afterwards but (b) if foo.x = bar.x is highly
> selective, it will be better to join to bar first and then
> DISTINCT-ify the result. Currently, aggregation is always last;
> that's not great. Hitoshi Harada's proposed strategy of essentially
> figuring out where the aggregation steps can go and then re-planning
> for each one is also not great, because each join problem will be a
> subtree of the one we use for the aggregate-last strategy, and thus
> we're wasting effort by planning the same subtrees multiple times.
> Instead, we might imagine letting grouping planner fish out the best
> paths for the joinrels that represent possible aggregation points,
> generate aggregation paths for each of those, and then work out what
> additional rels need to be joined afterwards. That sounds hard, but
> not as hard as doing something sensible with what we have today.
>
Once we support to add aggregation path during path consideration,
we need to pay attention morphing of the final target-list according
to the intermediate path combination, tentatively chosen.
For example, if partial-aggregation makes sense from cost perspective;
like SUM(NRows) of partial COUNT(*) AS NRows instead of COUNT(*) on
billion rows, planner also has to adjust the final target-list according
to the interim paths. In this case, final output shall be SUM(), instead
of COUNT().

I think, this planner enhancement needs a mechanism to inform planner
expected final output for each Path. It means, individual Path node
can have a replacement rule: which target-entry should have what
expression instead, once this path is applied.
In above case, PartialAggPath will have an indication; SUM(nrows)
shall be applied on the location where "COUNT(*)" is originally
placed.

In case of relations join that involves paths with this replacement
rules, it may be a bit complicated if an expression takes argument
come from both relations; like CORR(inner_rel.X, outer_rel.Y),
I'm not sure whether we can have more broken-down partial aggregate.
However, relations on either of inner/outer side are responsible to
the columns in inner/outer side. Unless all the arguments in aggregate
function come from same side, it should not block aggregate push-
down as long as estimated cost makes sense.

That is a probably makes a problem when we allow to add aggregate
path as a part of partial plan tree.

Thanks,

> I'm inclined to think that it would be useful to solve the first
> problem even if we didn't solve the second one right away (but that
> might be wrong). As a preparatory step, I'm thinking it would be
> sensible to split grouping_planner() into an outer function that would
> handle the addition of Limit and LockRows nodes and maybe planning of
> set operations, and an inner function that would handle GROUP BY,
> DISTINCT, and possibly window function planning.
>
> Thoughts?
>
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-06-23 03:45:08 Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Previous Message Tom Lane 2015-06-23 01:47:42 Re: less log level for success dynamic background workers for 9.5