Quick Links

Re: Slow count(*) again...

From:	Joe Uhl <joeuhl(at)gmail(dot)com>
To:	Neil Whelchel <neil(dot)whelchel(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Slow count(*) again...
Date:	2010-10-12 14:19:57
Message-ID:	4CB46E8D.1010800@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

>> The biggest single problem with "select count(*)" is that it is
>> seriously overused. People use that idiom to establish existence, which
>> usually leads to a performance disaster in the application using it,
>> unless the table has no more than few hundred records. SQL language, of
>> which PostgreSQL offers an excellent implementation, offers [NOT]
>> EXISTS clause since its inception in the Jurassic era. The problem is
>> with the sequential scan, not with counting. I'd even go as far as to
>> suggest that 99% instances of the "select count(*)" idiom are probably
>> bad use of the SQL language.
>
> I agree, I have seen many very bad examples of using count(*). I will go so
> far as to question the use of count(*) in my examples here. It there a better
> way to come up with a page list than using count(*)? What is the best method
> to make a page of results and a list of links to other pages of results? Am I
> barking up the wrong tree here?
One way I have dealt with this on very large tables is to cache the
count(*) at the application level (using memcached, terracotta, or
something along those lines) and then increment that cache whenever you
add a row to the relevant table. On application restart that cache is
re-initialized with a regular old count(*). This approach works really
well and all large systems in my experience need caching in front of the
DB eventually. If you have a simpler system with say a single
application/web server you can simply store the value in a variable, the
specifics would depend on the language and framework you are using.

Another more all-DB approach is to create a statistics tables into which
you place aggregated statistics rows (num deleted, num inserted, totals,
etc) at an appropriate time interval in your code. So you have rows
containing aggregated statistics information for the past and some tiny
portion of the new data happening right now that hasn't yet been
aggregated. Queries then look like a summation of the aggregated values
in the statistics table plus a count(*) over just the newest portion of
the data table and are generally very fast.

Overall I have found that once things get big the layers of your app
stack start to blend together and have to be combined in clever ways to
keep speed up. Postgres is a beast but when you run into things it
can't do well just find a way to cache it or make it work together with
some other persistence tech to handle those cases.

In response to

Re: Slow count(*) again... at 2010-10-12 02:36:45 from Neil Whelchel

Responses

Re: Slow count(*) again... at 2010-10-12 15:48:34 from david
Re: Slow count(*) again... at 2010-10-12 22:21:31 from Neil Whelchel

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2010-10-12 14:51:58	Re: Debugging initdb breakage
Previous Message	David Fetter	2010-10-12 14:14:39	Re: [JDBC] Support for JDBC setQueryTimeout, et al.

Browse pgsql-performance by date

	From	Date	Subject
Next Message	gnuoytr	2010-10-12 14:20:19	Re: How does PG know if data is in memory?
Previous Message	Mladen Gogala	2010-10-12 14:04:18	Re: Slow count(*) again...