Re: Combining Aggregates

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc: "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>
Subject: Re: Combining Aggregates
Date: 2015-07-27 03:16:18
Message-ID: CAKJS1f8wkD0mK4jdg5bHcTvFCUhGj76_VkNaEhNJVkqkg_FhVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27 July 2015 at 12:14, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:

> > The main use case people have been talking about is parallel query, but
> > is there some other case this would be useful right now, without the
> > parallel query feature? You and Simon talked about this case:
> >
> > > 2. Queries such as:
> > >
> > > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON
> s.product_id
> > > = p.product_id GROUP BY p.name;
> > >
> > > Such a query could be transformed into:
> > >
> > > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM
> sales
> > > GROUP BY product_id) s
> > > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
> > >
> > > Of course the outer query's SUM and GROUP BY would not be required if
> there
> > > happened to be a UNIQUE index on product(name), but assuming there's
> not
> > > then the above should produce the results faster. This of course works
> ok
> > > for SUM(), but for something like AVG() or STDDEV() the combine/merge
> > > aggregate functions would be required to process those intermediate
> > > aggregate results that were produced by the sub-query.
> >
> > Any chance you could implement that in the planner?
> >
>
It likely needs planner enhancement prior to other applications...
>
> http://www.postgresql.org/message-id/CA+TgmobgWKHfZc09B+s2LxJTwoRD5Ht-avoVDvaQ4+RpwrO4bg@mail.gmail.com
>
>
I had thought this too, but I'm not sure that's 100% correct. As I just
said in the my previous email on this thread, I am now working on "group by
before join". I had started it with the intentions of path-ifying the
grouping planner, but I soon realised that the grouping_planner() does
quite a bit more at that final stage that I propose to do to allow "group
by before join". This is mostly around handling of DISTINCT, HAVING and
LIMIT. I don't think those need to be handled in my patch, perhaps with the
exception of DISTINCT without GROUP BY, but not when both are present.

Instead I've started by inventing GroupingPath and I'm now building these
new path types once there's enough tables joined for all Vars of the GROUP
BY and agg parameters.

I believe this optimisation needs to be costed as pushing the GROUP BY down
to happen as early as possible is *not* always a win. Perhaps the JOIN is
very selective and eliminates many groups. Hence I've added costing and
allowed the planner to choose what it thinks is cheapest.

> Once planner allowed to have both of normal path and partial aggregation
> paths to compare according to the cost, it is the straightforward way to
> do.
>

I've ended up with 2 boolean members to struct GroupingPath, combine_states
and finalize_aggs. I plan to modify nodeAgg.c to use these, and EXPLAIN to
give a better description of what its doing.

>
> Here are various academic research, for example, below is the good starting
> point to clarify aggregate queries that we can run with 2-phase approach.
>
> http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join
>
>
Thanks, I've been using that very paper.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2015-07-27 03:43:28 Re: security labels on databases are bad for dump & restore
Previous Message David Rowley 2015-07-27 02:59:32 Re: Combining Aggregates