Re: The Future of Aggregation

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "amit(dot)kapila(at)enterprisedb(dot)com" <amit(dot)kapila(at)enterprisedb(dot)com>, Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
Subject: Re: The Future of Aggregation
Date: 2015-06-10 13:39:28
Message-ID: 1986970674.211372.1433943568062.JavaMail.yahoo@mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 10 June 2015 at 02:52, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
>> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>>> The idea I discussed in the link in item 5 above gets around this
>>> problem, but it's a perhaps more surprise filled implementation
>>> as it will mean "select avg(x),sum(x),count(x) from t" is
>>> actually faster than "select sum(x),count(x) from t" as the agg
>>> state for avg() will satisfy sum and count too.
>>
>> I'm skeptical that it will be noticeably faster. It's easy to see
>> why this optimization will make a query *with all three* faster,
>> but I would not expect the process of accumulating the sum and
>> count to be about the same speed whether performed by one
>> transition function or two. Of course I could be persuaded by a
>> benchmark showing otherwise.

Of course, after reading Tom's post and digging into what
aggregates share a transition function, I was already prepared to
eat my words above. Since the sum() aggregate is using the
xxx_avg_accum transition function, it is clearly doing the work of
accumulating the count already, so it's clear that the above can
be a win.

> Assuming that if we reuse the avg(x) state for count(x) and
> sum(x) then it will perform almost exactly like a query
> containing just avg(x), the only additional overhead is the call
> to the final functions per group, so in the following case that's
> likely immeasurable:
>
> /* setup */ create table millionrowtable as select
> generate_series(1,1000000)::numeric as x;
> /* test 1 */ SELECT sum(x) / count(x) from millionrowtable;
> /* test 2 */ SELECT avg(x) from millionrowtable;
>
> Test 1:
> 274.979 ms
> 272.104 ms
> 269.915 ms
>
> Test 2:
> 229.619 ms
> 220.703 ms
> 234.743 ms
>

> (About 19% slower)

Of course, with Tom's approach you would see the benefit; the two
statements should run at about the same speed.

I am a little curious what sort of machine you're running on,
because my i7 is much slower. I ran a few other tests with your
table for perspective.

To get the raw time to just pass the tuples:

SELECT from millionrowtable where xmin = '0';
Time: 125.340 ms
Time: 124.443 ms
Time: 115.629 ms

Just the count(*) of those rows didn't boost the time much:

SELECT count(*) from millionrowtable;
Time: 132.128 ms
Time: 128.036 ms
Time: 125.400 ms

The NULL check added by specifying count(x) boosted it more:

SELECT count(x) from millionrowtable;
Time: 165.858 ms
Time: 163.872 ms
Time: 165.448 ms

A NULL check plus numeric addition gets expensive:

SELECT sum(x) from millionrowtable;
Time: 366.879 ms
Time: 364.503 ms
Time: 365.418 ms

Since sum() and avg() use the same transition function, I was
suprised to see a difference here:

SELECT avg(x) from millionrowtable;
Time: 374.339 ms
Time: 372.294 ms
Time: 366.933 ms

Here's the statement you are talking about optimizing:

SELECT sum(x), count(x) from millionrowtable;
Time: 441.331 ms
Time: 442.501 ms
Time: 436.930 ms

To confirm that projecting the extra column compared to avg() was
not significant:

SELECT sum(x) / count(x) from millionrowtable;
Time: 442.404 ms
Time: 436.241 ms
Time: 442.381 ms

So this can reasonably be compared to the avg(x) time above.

On my machine this optimization could be expected to shave off
about 16% of current run time.

One question that arose in my mind running this was whether might
be able to combine sum(x) with count(*) if x was NOT NULL, even
though the arguments don't match. It might not be worth the
gymnastics of recognizing the special case, and I certainly
wouldn't recommend looking at that optimization in a first pass;
but it might be worth jotting down on a list somewhere....

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2015-06-10 13:41:59 Re: [COMMITTERS] pgsql: Add pg_audit, an auditing extension
Previous Message Bruce Momjian 2015-06-10 13:31:49 Re: s_lock() seems too aggressive for machines with many sockets