Re: Slow count(*) again...

From: Neil Whelchel <neil(dot)whelchel(at)gmail(dot)com>
To: "Pierre C" <lists(at)peufeu(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow count(*) again...
Date: 2010-10-12 23:19:33
Message-ID: 201010121619.34839.neil.whelchel@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Tuesday 12 October 2010 14:35:01 you wrote:
> > suggest that 99% instances of the "select count(*)" idiom are probably
> >
> >> bad use of the SQL language.
>
> Well, suppose you paginate results. If the user sees that the search query
> returns 500 pages, there are two options :
>
> - you're google, and your sorting algorithms are so good that the answer
> the user wants is in the first page
> - or the user will refine his search by entering more keywords tu get a
> manageable result set
>
> So, in both cases, the count(*) was useless anyway. And the slowest ones
> are the most useless, since the user will immediatey discard the result
> and refine his query.
>
> If your full text search is slow, try Xapian or Lucene.

I guess I have to comment here again and point out that while I am having this
issue with text searches, I avoid using count(*) in such cases, I just use
next and previous links. Where the real problem (for me) is that when someone
searches a date or time range. My application keeps track of huge amounts of
realtime transactional data. So an administrator might want a report as to
what some data point did yesterday between 3 and 4 PM. Under normal conditions
the range of records that match can be between 0 and over 5,000. This is
really killing me especially when the reporting people want a list of how many
transactions each that were on points in a given zipcode had this morning
between 8 and 9 AM, it takes about 5 minutes to run on a server that has
enough ram to hold the entire table!

Pseudo query:
Show how many transactions per node in zipcode 92252 between 8:00 and 9:00
today:

point_number | number_of_transactions
65889 | 31
34814 | 4865
28349 | 0
3358 | 364
...

24 total rows, > 5 minutes.

Then they want every node to be a link to a list of actual data within the
specified timeframe.
This is where I have to to the same query twice to first find out how many for
the page links, then again to get a page of results.
Sure, I could keep tables around that have numbers by the hour, minute, day or
whatever to cache up results for speeding things, then the problem is that
when the data is put into the server, there are so many statistics tables to
update, the front end becomes a huge problem. Also, it makes for a huge mess
of tables to think about when I need to make a report.

-Neil-

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-10-12 23:31:30 Re: Re: [GENERAL] Text search parser's treatment of URLs and emails
Previous Message Tom Lane 2010-10-12 23:10:08 Re: Review: Fix snapshot taking inconsistencies

Browse pgsql-performance by date

  From Date Subject
Next Message Dan Harris 2010-10-13 00:00:11 Re: Slow count(*) again...
Previous Message Neil Whelchel 2010-10-12 22:33:33 Re: Slow count(*) again...