Re: Aggregate Supporting Functions

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Aggregate Supporting Functions
Date: 2015-06-10 02:32:14
Message-ID: CAKJS1f9EQ64EJHOy81X38_K8u_XE7209NiuRwMU-Uh-FYO_pAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10 June 2015 at 01:53, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> > 3. Add logic in the planner to look for look for supporting
> > cases. With logic something along the lines of:
> >
> > a. Does the query have any aggregates? If not -> return;
> > b. Does the query have more than 1 aggregate? If not -> return;
> > c. Does the at least one of the aggregates have hassuppagg set
> > to true? if not -> return;
> > d. Analyze aggregates to eliminate aggregates that are covered
> > by another aggregate. We should use the aggregate which
> > eliminates the most other aggregates*
> >
> > * For example stddev(x) will support avg(x), sum(x) and count(x)
> > so a query such as select stddev(x), avg(x), sum(x), count(x)
> > will eliminate avg(x), sum(x), count(x) as stddev(x) supports 3,
> > avg(x) only supports 2, so will have to be eliminated.
>
> Are you assuming that "x" must match exactly among the aggregates
> to allow coverage?

In my haste I neglected to mention that critical part :) Yeah the
expression within the aggregation function must match exactly.
This would mean that count(*) would not optimise but count(x) would. I
believe that's an ok restriction on a first cut implementation, as that
rewrite requires some more NULLability analysis.

Have you given any thought to whether ties are
> possible and how they should be resolved?
>
>
I guess ties are possible, although I can't immediately think of which of
the standard aggs could cause that. I've not thought on it a great deal to
be honest. I'm no too sure if pg_proc.procost is better to check or if
pg_aggregate.aggtransspace would be better. I see procost is not very well
set for quite a lot of functions, e.g float4pl and numeric_sum currently
both have the cost of 1, which is certainly not the case. So perhaps it
would be easier to just use aggtransspace, and if there's still a tie then
just prefer the one that comes first in the targetlist. Likely that would
be good as it keeps it predictable and allows the users to have influence
on the decision.

> > I'm a little bit concerned that someone will one day report that:
> >
> > SELECT avg(x), sum(x), count(x) from bigtable;
> >
> > Is faster than:
> >
> > SELECT sum(x), count(x) from bigtable;
> >
> > Of course, this will be just because we've made case 1 faster,
> > NOT because we've slowed down case 2.
>
> Yeah, that seems possible (if not with these functions, at least
> technically possible with *some* functions), and hard to explain to
> a novice.
>
> > I can't immediately think of a way to fix that without risking
> > slowing down: select count(x) from bigtable;
>
> From the information you have proposed storing, with cost factors
> associated with the functions, it seems technically possible to
> infer that you could run (for example) the avg() aggregate to
> accumulate both but only run the final functions of the aggregates
> referenced by the query. That seems like an optimization to try
> hard to forget about until you have at least one real-world use
> case where it would yield a significant benefit. It seems
> premature to optimize for that before having the rest working.
>

I see your point, and in my example I think your idea works well, but
there's a risk of slowing things down here too for other cases.

With the "supporting aggregates" we're just listing the aggregates we
happen to calculate as a byproduct. Using their value is about as cheap as
calling the final function, but with this, if we decided to use avg(x)
instead of sum(x) and count(x) we really have no way to understand what
else avg(x) might be doing. For example, if we pretend avg() does not
exist, and we have stddev(x), sum(x) and count(x) as aggregates, given the
query "select sum(x), count(x) from t"... stddev(x) gives us count(x) and
sum(x) as a byproduct, but stddev also calculates sum(x*x), which quite
likely ends up slower than just doing sum(x) and count(x) like normal.
Perhaps the code that looks for these patterns could take the aggregate
that supports the smallest number of supporting aggregates that's enough to
cover the requirements, but still...

We could perhaps add some smarts that analyses the costs of the transition
function calls here which goes to ensure that stuff won't actually happen.
Maybe that's phase 3 though. I don't think I want to touch it at this
stage, but for sure something to remember for the future!

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2015-06-10 02:51:05 Re: The Future of Aggregation
Previous Message Tom Lane 2015-06-10 02:28:35 Re: [patch] PL/Python is too lossy with floats