Improving count(*)

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Improving count(*)
Date: 2005-11-17 19:28:10
Message-ID: 1132255690.4959.207.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

One of the major complaints is always "Select count(*) is slow".

I have a somewhat broadbrush idea as to how we might do this (for larger
tables).

Previously, we've said that maintaining a running count of the tuples in
a table is hard, or at least costly. However, maintaining a partial
count may be somewhat easier and offer the performance required.

Bearing in mind other RDBMS' approach is to count the number of rows in
an index, their cost is probably about the same as scanning table
blocks/10 very roughly - so the cost is far from zero for them. In order
to get similar performance we would need a technique that involved
scanning a small percentage of the data blocks in a table, typically
less than 10%.

Prelims: As tuples are inserted they go into blocks in a varied
sequence. However, since only VACUUM ever reduces the size of the data
in a block, we notice that blocks eventually fill up. If they are on the
FSM, they are removed. At that point, we could begin keeping track of
the number of live tuples in the block.

So we could imagine an on-disk data structure that is an array of this
struct:
blockid - the block in question
count - the number of live tuples in that block
txnid - the txnid of the last tuple written

When we run a SELECT COUNT(*) we would read this data structure and use
it to build a bitmap of blocks that still need to be scanned to get the
count(*) result. So this would give us a partially cached result for
count(*), which then would require scanning only the last few blocks of
a table to get at the correct answer.

This is beginning to sound very much like the same sort of solution as
the one recently proposed for VACUUM: we keep track of the status of
each block, then use it as a way of speeding things up.

For VACUUM we want to keep a data structure that notes which blocks
require vacuums. For count(*) we want to keep track of which blocks do
not require vacuums, nearly. There are some blocks in the middle that
wouldn't go on either list, true.

So, why not combine the two purposes into a single solution?

We would be able to manage at least 800 data blocks for each block of
this structure, which would mean 1.25 MB per GB of data. For tables with
a reasonably non-random block write pattern the key parts of that could
reside in memory with reasonable ease even for busy tables. In those
cases, it also seems like this technique might lead to the need to scan
only a small percentage of heap blocks - so this would give roughly
equivalent performance to other RDBMS for SELECT count(*). If its a data
structure that we were thinking of maintaining for VACUUM anyway, this
improves the value of the cost we would have to pay to maintain the a
cache data structure.

This is thinking only, I'm not proposing this as a fully worked proposal
nor am I pursuing this myself. I can see immediately that there are some
major obstacles in the way, like MVCC, autovacuum, overheads etc but it
seems worth pointing out a possible angle of attack, since this looks
like it might hit two birds with one stone.

I'm not aiming to start "open season" on crazy ideas for this; this idea
may take months to ruminate into a solution.

Best Regards, Simon Riggs

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-11-17 19:29:49 Re: Some array semantics issues
Previous Message Magnus Hagander 2005-11-17 19:05:34 Re: [HACKERS] ERROR: could not read block