aggregation problem: first/last/count(*)

From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: aggregation problem: first/last/count(*)
Date: 2009-01-26 14:48:47
Message-ID: C4DAC901169B624F933534A26ED7DF311D537C@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hello,

I have a query to aggregate data wich is too slow :-)

Here a simplified example:

create table test
(
time int8, --store the time as epoch
a_group varchar,
category varchar
)

For each group, I need the first/last times and categories , the number
of distinct categories and the number of records.

Here my best solution until now:

SELECT
FIRST.a_group,
FIRST.time as first_time,
FIRST.category as first_category,
LAST.time as last_time,
LAST.category as last_category,
AGG.c_count,
AGG.c_all
FROM

(
select distinct on (a_group)
a_group,
time,
category
from test
order by a_group, time
) FIRST,

(
select distinct on (a_group)
a_group,
time,
category
from test
order by a_group, time DESC
) LAST,

(
select a_group,
count(distinct category) as c_count,
count(*) as c_all
from test
group by a_group
order by a_group
) AGG

where FIRST.a_group = LAST.a_group
and LAST.a_group=AGG.a_group

each sub query is quite fast -- thanks for the DISTINCT ON feature :-) ,
but the whole is really slow as Postgres start to swap due to the large
amount of data to merge.

I guess there must be a better solution as the three sub queries return
exactly one row for each 'a_group'
and are moreover already sorted (The table does not contain any NULL
value).
But in the query plan below, most of the cost comes form the merges.

I imagine there must be a way using custom aggregation functions, but
I'm not confident with those:

Is it possible to define aggregate in order to retrieve the first/last
values of an ordered result set? This would allow to make a single scan
of the table.
something like

select a_group,
first(category) as first_category,
last(category) as last_category,
...
from test
order by a_group,time

Many thanks for any hints.

Marc Mamin

Here are some dummy values if you'd like to play with this issue:

insert into test select s,'G'||s , 'C1' from(select
generate_series(1,10000)as s)s;
insert into test select s+10,'G'||s , 'C2' from(select
generate_series(1,10000)as s)s;
insert into test select s+13,'G'||s , 'C3' from(select
generate_series(1,10000)as s)s;
insert into test select s+1,'G'||s , 'C2' from(select
generate_series(1,10000,5)as s)s;
insert into test select s,'G'||s%10 , 'C3' from(select
generate_series(1,10000,5)as s)s;
insert into test select s+1,'G'||s%5 , 'C2' from(select
generate_series(1,10000,5)as s)s;
insert into test select s+1,'G'||s , 'C1' from(select
generate_series(1,1000000)as s)s; --10^6 !!

create index test_i on test(a_group);
analyze test;

=>

Merge Join (cost=259000.31..34904377039.75 rows=1550421099181
width=128)
Merge Cond: ((test.a_group)::text = (last.a_group)::text)
-> Merge Join (cost=129500.16..17814340.14 rows=783387153 width=120)
Merge Cond: ((test.a_group)::text = (test.a_group)::text)
-> GroupAggregate (cost=0.00..53681.23 rows=395825 width=10)
-> Index Scan using test_i on test (cost=0.00..39973.53
rows=1036043 width=10)
-> Materialize (cost=129500.16..133458.41 rows=395825
width=72)
-> Unique (cost=119965.87..125146.08 rows=395825
width=18)
-> Sort (cost=119965.87..122555.97 rows=1036043
width=18)
Sort Key: test.a_group, test."time"
-> Seq Scan on test (cost=0.00..16451.43
rows=1036043 width=18)
-> Materialize (cost=129500.16..133458.41 rows=395825 width=72)
-> Subquery Scan last (cost=119965.87..129104.33 rows=395825
width=72)
-> Unique (cost=119965.87..125146.08 rows=395825
width=18)
-> Sort (cost=119965.87..122555.97 rows=1036043
width=18)
Sort Key: test.a_group, test."time"
-> Seq Scan on test (cost=0.00..16451.43
rows=1036043 width=18)

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Volkan YAZICI 2009-01-26 15:27:24 Re: aggregation problem: first/last/count(*)
Previous Message Suha Onay 2009-01-26 07:57:07 Invitation to connect on LinkedIn