Re: count(*) and bad design was: Experiences with extensibility

From: Zoltan Boszormenyi <zb(at)cybertec(dot)at>
To: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: count(*) and bad design was: Experiences with extensibility
Date: 2008-01-09 19:29:39
Message-ID: 478520A3.8010407@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ivan Sergio Borgonovo írta:
> On Wed, 9 Jan 2008 13:04:39 +0100
> "Harald Armin Massa" <haraldarminmassa(at)gmail(dot)com> wrote:
>
>
>> Ivan,
>>
>
>
>>> Please forgive my naiveness in this field but what does it mean an
>>> "exact count" and what other DB means with "an exact count" and
>>> how other DB deal with it?
>>>
>
>
>> PostgreSQL will give you an exact count of the contents of the
>> database as it is in the moment you begin your count. (i.e. the
>> transaction starts)
>>
>
>
>> BUT as the table is not locked, in parallel somebody can bulkload
>> MANY items into the database, so at the moment (start of your
>> transaction)
>> + 1msec your count may be invalid allready.
>>
>
> That's reasonable. What other DB do and what is the SQL standard
> requirement for count(*)?
>
>
>>> I'd expect it perform as good or better than other DB since now
>>> the bottleneck should be how efficiently it can filter records...
>>> but still a count(*) with a where clause will incur in the same
>>> problem of what "exact" means.
>>>
>> I know of 3 concepts to answer count() faster then PostreSQL:
>>
>> 1) just lie, present an estimate
>> 2) do not have MVCC
>> 3) store "record deleted info" in index, so you can answer count()
>> with only scanning the index
>>
>
> Sorry if I insist but I think this is a beaten selling point against
> postgresql.
> One of the most recurring argument about why someone else db is
> better than pg is count is slow.
>
> Who lies?
> If it is possible to do otherwise to have count run faster what are
> the trade off that make it unreasonable to implement it in pg?
>

The decision to use MVCC in PostgreSQL makes the point moot.
What your transaction cannot see (e.g. another transaction inserted
a new record but didn't COMMITted yet) cannot be counted as visible.
It's not only a theoretical but also practical. Compare that to the
MaxDB way. At the very lowest level MaxDB keeps the records
in a balanced b-tree based on its primary key. Look at this URL:
http://blog.ulf-wendel.de/?p=76
MaxDB keeps how many records are in the tree at all times which is
handy if your database is close to idle. Consequences:

- very fast COUNT(*) when idle
- records unCOMMITted by other transactions can be seen by COUNT(*)
if your transaction is in "read committed" isolation level but not the
actual records
- COUNT(*) waits for all other transactions that modified the table
in question to finish if your transaction is in "repeatable read" or
"serializable" isolation levels. Consider that transactions can take
a loooong time to finish if they process many things. This way your
SELECT COUNT(*) doesn't respond instantly but doesn't slow down
your server either. But the end user perception is the same:
COUNT(*) is slow!

In PostgreSQL, COUNT(*) responds closely at the same speed regardless
of other transactions. Which way do you prefer?

> This is not very useful question but 3) imply that select scan the
> index return the rows and just later check if they are still there.
> Is it?
> And since indexes aren't updated "on the fly" you may get back a
> larger number than what is actually the real value.
>
> Let me consider an everyday use where count() looks as the most
> obvious solution: paging.
>
> I search trough a table and I need to know which is the last page.
> Concurrently someone is deleting a ton of records.
> No matter if count() is fast or not when I output the pager it will
> be "wrong". But still I'll need an estimate of the number of pages,
> it is not fun if that estimate is wrong *and* slow.
> And once you add the where clauses there is no way to cache the count.
> Is there a way to count based on indexes without taking into account
> deleted rows so to "count" faster?
>
> I can make the search faster using indexes as Dann Corbit suggested,
> but as you imply that won't make count as fast as the
> "concurrence"[1] that lie or don't use MVCC or store deleted info in
> indexes.
>
> SELECT reltuples FROM pg_class WHERE relname = <table_name>;
>
> doesn't apply since you can't add "conditions".
>
> Please be patient. I hear this over and over and over. Postgresql is
> faulty because it can't count fast.
> And the only reply I've seen are: it is bad design to use count,
> you're a dumb ass. I admit I may be a dumb ass, but it is hard to
> sell Postgres if I continue to be a dumb ass ;)
>
> - What count(*) really does?
> Now I understood that count(*) return the # of rows as it sees them
> at the moment it was invoked. That should be true for other DB as
> well. That means that unless other DB lock the table they can't take
> into account records that are going to be deleted once the count has
> been issued.
>
> - When count can be avoided?
> Well since even other DB may report the "wrong" number, this makes
> count() look less a Sacre Graal. But still if you need an estimate,
> wouldn't it be better to have it quickly?
> How does postgresql compare to other DB when you run:
> select count(*) from table where conditions
> once you use indexes?
> If such kind of query will have anyway to scan the results to see if
> they are still there since info about deleted records aren't stored
> in indexes, is there a way to ignore this and just have a faster
> estimate?
> I still can't see why it is bad design to use count().
>
> - When count() can't be avoided?
> All the situation where you may really need count() I think you also
> need to lock the table but well I'd be curious to see an example
> where you need count()
>
> Still can somebody make an example of bad design and one where
> count() couldn't be avoided if any?
>
> Consider that while it makes few sense to rely on "wrong" numbers in
> a "business" environment where data integrity/coherence makes *a lot*
> of sense it is not so clear in a CMS world where most of those
> critics come from.
>
> I know that the arguments to promote postgres in the "business" world
> where DB2, Oracle and MS SQL play (?) may be different and count()
> may lose its importance in that context and you could say that other
> advantages plenty pay off the "slowness" of an operation that in such
> a context is rare.
>
>
> thanks
>
>
> [1] or does postgres perform as the concurrence once you add where
> clauses?
>
>
>

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Guido Neitzer 2008-01-09 19:38:43 Re: Experiences with extensibility
Previous Message Ivan Sergio Borgonovo 2008-01-09 19:27:41 Re: count(*) and bad design was: Experiences with extensibility