Re: Aggregate query for multiple records

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Scott Gerhardt <scott(at)g-it(dot)ca>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Aggregate query for multiple records
Date: 2004-08-28 08:35:58
Message-ID: 87eklrlkwh.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


Scott Gerhardt <scott(at)g-it(dot)ca> writes:

> Hello, I am new to the list, my apology if this question is beyond the scope or
> charter of this list.
>
> My questions is:
> What is the best method to perform an aggregate query to calculate sum() values
> for each distinct wid as in the example below, but except for all wid's (not
> just WHERE wid='01/1-6-1-30w1/0').
>
> Also, performance wise, would it be better to build a function for this query.
> The table has 9 million records and these aggregate queries take hours.

The "top n" type query (or in this case "first n" or "last n" but it's the
same thing) is actually very tricky to do in standard SQL. The best solution
seen here for postgres is to use arrays and custom aggregate functions.

The following is based on a previous answer from Tom Lane to a similar
question. (I thought I already posted this for you on pgsql-general but the
list archives are down and you don't seem to have seen it, so I'm resending
it)

It allows you to do the whole query with a single sort for the grouping and
the ordering by date together. You would have to use it with something like:

SELECT sum_first_6(oil)
FROM (SELECT oil from prd_data ORDER BY wid, "date")
GROUP BY wid

If you pump up sort_mem enough -- you can do it within the session for the
single connection using "SET sort_mem" it should be pretty fast.

I think it's the best you're going to get. If you're absolutely sure the data
is physically stored in chronological order -- which I would only feel
comfortable with if you've never done any updates or deletes, only inserts and
perhaps occasional truncates, then you might be able to get by without
ordering and convince it to do a hash aggregate. That would be the optimal
result, no sorts at all. But it would be hard to make sure it would always
work.

test=> create or replace function first_6_accum (integer[], integer) returns integer[] language sql immutable as 'select case when array_upper($1,1)>=6 then $1 else $1||$2 end';
CREATE FUNCTION

test=> create function sum_6(integer[]) returns integer immutable language sql as 'select $1[1]+$1[2]+$1[3]+$1[4]+$1[5]+$1[6]';
CREATE FUNCTION

test=> create aggregate sum_first_6 (basetype=integer, sfunc=first_6_accum, stype=integer[],initcond='{}',finalfunc=sum_6);
CREATE AGGREGATE

test=> select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8) as x order by i desc) as x;
sum_first_6
-------------
33
(1 row)

test=> select sum_first_6(i) from (select i from (select 1 as i union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8) as x order by i asc) as x;
sum_first_6
-------------
21
(1 row)

This can easily be switched around to make it "last_6" and you can write
functions to handle 6 records or 9 records. And all of these could be combined
in a single query, so you only have to do the sort once.

Unfortunately you cannot make aggregate functions that take multiple
parameters, nor can you pass extra parameters to the state function. So you'll
have to create a separate set of functions for each variant.

Also, you'll have to change it to use reals.

--
greg

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Gaetano Mendola 2004-08-28 12:48:40 Re: backup of a specific schema
Previous Message Tom Lane 2004-08-28 03:05:41 Re: [SQL] array_in: '{}}'::text[]