Re: slow group by query

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ellen Cyran <ellen(at)urban(dot)csuohio(dot)edu>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: slow group by query
Date: 2002-11-19 16:54:08
Message-ID: 21604.1037724848@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Ellen Cyran <ellen(at)urban(dot)csuohio(dot)edu> writes:
> Here is the explain analyze:

> Group (cost=637.18..696.52 rows=593 width=22) (actual time=982.67..67581.85 rows=435 loops=1)
> -> Sort (cost=637.18..637.18 rows=5934 width=22) (actual time=833.27..844.78 rows=6571 loops=1)

Well, we don't have to read any further than that to see that all the
time is being spent in the final Group step --- and since grouping 6500
rows is surely not taking very long, the true cost must be in the
evaluation of the SELECT's output targetlist (which the estimated costs
ignore, since the planner can't do much to change it). In other words,
what's hurting you are those subselects in the SELECT list. You're
doing two thousand separate subselects (435*5) --- evidently they take
about 30 msec apiece, which isn't that bad by itself, but it adds up.

What you've basically got here is what Joe Celko calls a "crosstab
by subqueries" (_SQL For Smarties_, section 23.6.3). You might want
to buy his book, which shows several other ways to do crosstab queries,
but the one that seems most directly useful is to GROUP BY and use
aggregate functions to extract the values you want in each crosstab
column:

SELECT msa_code, sic, own, ind_div,
max(case when year = '1975' then emp else null end) as emp1975,
max(case when year = '1976' then emp else null end) as emp1976,
max(case when year = '1977' then emp else null end) as emp1977,
max(case when year = '1978' then emp else null end) as emp1978,
max(case when year = '1990' then emp else null end) as emp1990
FROM tbl_bls_msa
WHERE msa_code in ('1680','1640','0040','0120','0080')
GROUP BY msa_code, sic, ind_div, own ;

If I understand your data schema correctly, only one row in a
msa_code, sic, ind_div, own group will have a particular year
value, so the case/max structure extracts that value, or gives
NULL if there's no such row. (MIN would have worked as well;
with a numeric field you could use SUM too.)

regards, tom lane

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Giannis 2002-11-19 17:06:57 SQL -select count-
Previous Message Ellen Cyran 2002-11-19 16:26:15 Re: slow group by query