Slow performance with Group By

From: Erik Norvelle <signups(at)norvelle(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Slow performance with Group By
Date: 2004-11-09 00:09:58
Message-ID: B14B76F4-31E3-11D9-8120-000A9583BF06@norvelle.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Greetings all,

This question has probably been asked many times, but I was unable to
use the list archives to search, since the term "Group" matches
thousands of of messages with the names of user groups in them... so
sorry if I'm repeating!

Here's the problem: I have a table of 10,000,000 records called
"indethom", each record representing a word in the works of a
particular author. Each record contains, among other columns, an
CHAR(5) column representing the "lemma" code (i.e. which word it is)
called "codelemm", and an integer representing a textual unit, i.e.
chapter or other division of a work (these are numbered consecutively
from 0 to around 50,000), called "sectref". What I want to do is find
out how many times every word occurs in each textual unit (or no row
returned for textual units where a particular word doesn't appear). I
used a group-by clause to group by "sectref", and then used the
COUNT(codelemm) function to sum up the occurrences. The codelemm
column had to be grouped on, in order to satisfy Postgres's
requirements. Here's the query as I have it:

> create table matrix2.tuo as select codelemm, sectref, count(codelemm)
from indethom group by codelemm, sectref;

And the explain results are as follows:

>it=> explain select codelemm, sectref, count(codelemm) from indethom
group by codelemm, sectref;
> QUERY PLAN
>-----------------------------------------------------------------------
---------
> GroupAggregate (cost=2339900.60..2444149.44 rows=1790528 width=13)
> -> Sort (cost=2339900.60..2364843.73 rows=9977252 width=13)
> Sort Key: codelemm, sectref
> -> Seq Scan on indethom (cost=0.00..455264.52 rows=9977252
width=13)

I have an index defined as follows:

> create index indethom_clemm_sect_ndx on indethom using
btree(codelemm, sectref);

I also performed an ANALYZE after creating the index.

I have the gut feeling that there's got to be a better way than a
sequence scan on 10,000,000 records, but I'll be darned if I can find
any way to improve things here.

Thanks for any help you all can offer!!

Erik Norvelle

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message John Meinel 2004-11-09 00:31:33 Re: vacuum analyze slows sql query
Previous Message Pierre-Frédéric Caillaud 2004-11-09 00:04:25 Re: vacuum analyze slows sql query