Re: Performance Issues with count()

From: "S Grannis" <sjg(at)email(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Performance Issues with count()
Date: 2002-04-25 18:54:40
Message-ID: 20020425185441.10228.qmail@email.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

Thanks for all of the useful comments.

Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to 2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time.

Others have suggested the "fix" is in the future.

Stephan Szabo wrote:
"I didn't see anything about your settings in postgresql.conf,
but increasing the sort_mem parameter may help that really
expensive sort step. I think the desired fix for this would
probably be the TODO entry on hash based aggregates but that's
still in the future..."

The "non-default" postgresql.conf settings are as follows:

shared_buffers = 240000 # uses ~2GB of shared mem
sort_mem = 512
wal_files = 64
enable_seqscan = 0 # per a recommendation
enable_indexscan = true
enable_tidscan = true
enable_sort = true
enable_nestloop = true
enable_mergejoin = true
enable_hashjoin = true

I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl script to do the actual counting.

Thanks again for the feedback,

Shaun Grannis

----- Original Message -----
From: Michael Loftis <mloftis(at)wgops(dot)com>
Date: Wed, 24 Apr 2002 10:14:04 -0700
To: asdf asdasfa <sjg(at)email(dot)com>
Subject: Re: [GENERAL] Performance Issues with count()

> Can you humour me and
> set enable_seqscan=0
> And retry the query?
>
> Thanks :)
>
> S Grannis wrote:
>
> > Hi,
> >
> > I've found some performance issues with Postgres that
> > I'm hoping people on this list can help resolve. We're
> > working with a 65 million record table that includes year
> > of birth (data type INT). To count the frequency of dates
> > in the table, it takes 2 hours 26 minutes to execute.
> > (There's an approximately 100-year range of dates in the
> > 65 million records).
> >
> > # EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
> >
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
> > -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
> > -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
> > -> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
> > I can count data from the flat text data file with this
> > Perl script:
> >
> > #!/usr/bin/perl
> > # script to count YB frequencies in flat data file
> > open (IN, "$ARGV[0]");
> > open (OUT, ">$ARGV[0]\_cnt");
> > while (<IN>) {
> > chomp;
> > $years{$_}++;}
> > foreach $key (keys %years) {
> > print OUT "$key,$years{$key}\n";}
> >
> > The Perl script takes *1 minute*, 31 seconds to run.
> > Why is there such a discrepancy in times? I've noticed
> > that the Postgres count() function takes what seems to
> > be "longer than it should" in other cases as well. For
> > instance, counting the frequency of last names in the
> > same 65 million record table took *1 hour* and 31
> > minutes:
> >
> > # EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
> > -> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
> > -> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
> > -> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
> >
> > The last name ( ln) and the year of birth ( yb) is
> > indexed, but that shouldn't matter because it's doing a
> > sequential scan, correct? Am I running into the
> > limitations of Postgres? We'd like to eventually get this
> > system into production, but if we can't get Postgres to
> > count() faster, we may not be able to use it.
> >
> > Here's the data_table schema:
> >
> > # \d data_table
> >
> > Table "data_table"
> > Column | Type | Modifiers
> > --------+---------------+-----------
> > ss | character(9) |
> > ln | character(15) |
> > fn | character(15) |
> > mi | character(1) |
> > ns | character(15) |
> > lny | character(15) |
> > fny | character(15) |
> > sny | character(15) |
> > g | character(1) |
> > mb | integer |
> > db | integer |
> > yb | integer |
> > md | integer |
> > dd | integer |
> > yd | integer |
> > Indexes: ssdi_ss_idx
> > ssdi_ln_idx
> >
> > We're working with Postgres v 7.2. The machine is a
> > dual-processor Athlon MP1900 (Tyan Tiger board) with
> > 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
> > configured in a software RAID 0 Array running under
> > RedHat Linux v. 7.2.
> >
> > We've VACUUM ANALYZE'd the tables after creating the
> > indices. Is there something I'm missing here?
> >
> > Thanks for your suggestions.
> >
> > Shaun Grannis
--

_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com
http://www.email.com/?sr=signup

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2002-04-25 18:58:06 Re: pg_restore and permissions
Previous Message Florian Weimer 2002-04-25 18:30:41 Re: pg_restore and permissions

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2002-04-25 19:03:51 Re: explicitly casting return value of avg() to float4
Previous Message tony 2002-04-25 18:42:04 Re: Look for a French user to help here : Technique doc