Aggregate Supporting Functions

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Aggregate Supporting Functions
Date: 2015-06-09 03:55:14
Message-ID: CAKJS1f8ebkc=EhEq+ArM8vwYZ5vSapJ1Seub5=FvRRuDCtFfsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I believe this is an idea that's been discussed before, but I'm not exactly
sure where that happened:

Overview:

The idea is that we skip a major chunk of processing in situations like:

SELECT avg(x),sum(x),count(x) FROM bigtable;

Because avg(x) already technically knows what the values of sum(x) and
count(x) are.

The performance improvement of this particular case is as follows:

create table bigtable as select
generate_series(1,1000000)::numeric as x; vacuum bigtable;

SELECT avg(x),sum(x),count(x) FROM bigtable; -- Query 1

Time: 390.325 ms
Time: 392.297 ms
Time: 400.790 ms

SELECT avg(x) FROM bigtable; -- Query 2

Time: 219.700 ms
Time: 215.285 ms
Time: 233.691 ms

With the implementation I'm proposing below, I believe that query 1 should
perform almost as well as query 2. The only extra CPU work that would be
done would be some extra checks during planning, and the calling of 2
simple new final functions which will extract the count(x) and sum(x) from
the avg transition's state.

Purpose of this Email:

For technical review of proposed implementation.

Implementation:

1. Add a new boolean column pg_aggregate named hassuppagg which will be set
to true if the aggregate supports other aggregates. For example avg(int)
will support count(int) and sum(int)
2. Add new system table named pg_aggregate_support (Or some better shorter
name)

This system table will be defined as follows:
aspfnoid regproc,
aspfnsupported regproc,
aspfinalfn regproc,
primary key (aspfnoid, aspfnsupported)

Where in the above example aspfnoid will be avg(int) and 2 rows will exist.
1 with count(int) in aspfnsupported, and one with sum(int) in the
aspfnsupported column. aspfinalfn will be a new final function which
extracts the required portion of the avg's aggregate state.

3. Add logic in the planner to look for look for supporting cases. With
logic something along the lines of:

a. Does the query have any aggregates? If not -> return;
b. Does the query have more than 1 aggregate? If not -> return;
c. Does the at least one of the aggregates have hassuppagg set to true?
if not -> return;
d. Analyze aggregates to eliminate aggregates that are covered by another
aggregate. We should use the aggregate which eliminates the most other
aggregates*

* For example stddev(x) will support avg(x), sum(x) and count(x) so a query
such as select stddev(x), avg(x), sum(x), count(x) will eliminate avg(x),
sum(x), count(x) as stddev(x) supports 3, avg(x) only supports 2, so will
have to be eliminated.

Concerns:

I'm a little bit concerned that someone will one day report that:

SELECT avg(x), sum(x), count(x) from bigtable;

Is faster than:

SELECT sum(x), count(x) from bigtable;

Of course, this will be just because we've made case 1 faster, NOT because
we've slowed down case 2.
I can't immediately think of a way to fix that without risking slowing
down: select count(x) from bigtable;

CREATE AGGREGATE Syntax:

To allow users to implement aggregates which take advantage of this we'd
better also expand the CREATE AGGREGATE syntax.

I've not given this a huge amount of thought. The only thing I've come up
with so far is;

CREATE AGGREGATE avg(bigint)
(FINALFUNC = avgfinal)
SUPPORTS (count(bigint) = int8_avg_countfn, sum(bigint) = int8_avg_sumfn);

Can anyone think of anything that I've not accounted for before I go off
and work on this?

Regards

David Rowley

"The research leading to these results has received funding from the
European Union’s
Seventh Framework Programme (FP7/2007-2015) under grant agreement n° 318633"

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-06-09 04:00:15 Information of pg_stat_ssl visible to all users
Previous Message Fujii Masao 2015-06-09 03:47:49 Re: Memory leak with XLogFileCopy since de768844 (WAL file with .partial)