Re: Performance Issues with count()

From: Grant Johnson <grant(at)amadensor(dot)com>
To: S Grannis <sjg(at)email(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Performance Issues with count()
Date: 2002-04-25 19:19:12
Message-ID: 3CC856B0.8060402@amadensor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

Index the yb col. It might help the sort and group run better.

create index data_table_yb on data_table (yb);
vacuum analyze;

EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;

SELECT yb, count(1) FROM data_table GROUP BY yb;

Also try a count(*) instead, it will count any row with any non-null values.

S Grannis wrote:

>Thanks for all of the useful comments.
>
>Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to 2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time.
>
>Others have suggested the "fix" is in the future.
>
>Stephan Szabo wrote:
>"I didn't see anything about your settings in postgresql.conf,
>but increasing the sort_mem parameter may help that really
>expensive sort step. I think the desired fix for this would
>probably be the TODO entry on hash based aggregates but that's
>still in the future..."
>
>The "non-default" postgresql.conf settings are as follows:
>
>shared_buffers = 240000 # uses ~2GB of shared mem
>sort_mem = 512
>wal_files = 64
>enable_seqscan = 0 # per a recommendation
>enable_indexscan = true
>enable_tidscan = true
>enable_sort = true
>enable_nestloop = true
>enable_mergejoin = true
>enable_hashjoin = true
>
>I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl script to do the actual counting.
>
>Thanks again for the feedback,
>
>Shaun Grannis
>
>----- Original Message -----
>From: Michael Loftis <mloftis(at)wgops(dot)com>
>Date: Wed, 24 Apr 2002 10:14:04 -0700
>To: asdf asdasfa <sjg(at)email(dot)com>
>Subject: Re: [GENERAL] Performance Issues with count()
>
>
>>Can you humour me and
>>set enable_seqscan=0
>>And retry the query?
>>
>>Thanks :)
>>
>>S Grannis wrote:
>>
>>>Hi,
>>>
>>>I've found some performance issues with Postgres that
>>>I'm hoping people on this list can help resolve. We're
>>>working with a 65 million record table that includes year
>>>of birth (data type INT). To count the frequency of dates
>>>in the table, it takes 2 hours 26 minutes to execute.
>>>(There's an approximately 100-year range of dates in the
>>>65 million records).
>>>
>>># EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
>>>
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
>>>-> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
>>>-> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
>>>-> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
>>>I can count data from the flat text data file with this
>>>Perl script:
>>>
>>>#!/usr/bin/perl
>>># script to count YB frequencies in flat data file
>>>open (IN, "$ARGV[0]");
>>>open (OUT, ">$ARGV[0]\_cnt");
>>>while (<IN>) {
>>>chomp;
>>>$years{$_}++;}
>>>foreach $key (keys %years) {
>>>print OUT "$key,$years{$key}\n";}
>>>
>>>The Perl script takes *1 minute*, 31 seconds to run.
>>>Why is there such a discrepancy in times? I've noticed
>>>that the Postgres count() function takes what seems to
>>>be "longer than it should" in other cases as well. For
>>>instance, counting the frequency of last names in the
>>>same 65 million record table took *1 hour* and 31
>>>minutes:
>>>
>>># EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
>>>-> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
>>>-> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
>>>-> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
>>>
>>>The last name ( ln) and the year of birth ( yb) is
>>>indexed, but that shouldn't matter because it's doing a
>>>sequential scan, correct? Am I running into the
>>>limitations of Postgres? We'd like to eventually get this
>>>system into production, but if we can't get Postgres to
>>>count() faster, we may not be able to use it.
>>>
>>>Here's the data_table schema:
>>>
>>># \d data_table
>>>
>>>Table "data_table"
>>> Column | Type | Modifiers
>>>--------+---------------+-----------
>>> ss | character(9) |
>>> ln | character(15) |
>>> fn | character(15) |
>>> mi | character(1) |
>>> ns | character(15) |
>>> lny | character(15) |
>>> fny | character(15) |
>>> sny | character(15) |
>>> g | character(1) |
>>> mb | integer |
>>> db | integer |
>>> yb | integer |
>>> md | integer |
>>> dd | integer |
>>> yd | integer |
>>>Indexes: ssdi_ss_idx
>>> ssdi_ln_idx
>>>
>>>We're working with Postgres v 7.2. The machine is a
>>>dual-processor Athlon MP1900 (Tyan Tiger board) with
>>>3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
>>>configured in a software RAID 0 Array running under
>>>RedHat Linux v. 7.2.
>>>
>>>We've VACUUM ANALYZE'd the tables after creating the
>>>indices. Is there something I'm missing here?
>>>
>>>Thanks for your suggestions.
>>>
>>>Shaun Grannis
>>>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Michael Glenn 2002-04-25 19:27:38 Re: Database 'xxxx', OID yyyyy, has disappeared from pg_database
Previous Message Neil Conway 2002-04-25 19:03:53 Re: Performance Issues with count()

Browse pgsql-general by date

  From Date Subject
Next Message Jon Lapham 2002-04-25 19:22:04 Re: explicitly casting return value of avg() to float4
Previous Message Neil Conway 2002-04-25 19:03:53 Re: Performance Issues with count()