Re: Select max(foo) and select count(*) optimization

From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Select max(foo) and select count(*) optimization
Date: 2004-01-06 02:31:29
Message-ID: m3r7yddczi.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

pg(at)rbt(dot)ca (Rod Taylor) wrote:
>> Especially with very large tables, hearing the disks grind as Postgres scans
>> every single row in order to determine the number of rows in a table or the
>> max value of a column (even a primary key created from a sequence) is pretty
>> painful. If the implementation is not too horrendous, this is an area where
>> an orders-of-magnitude performance increase can be had.
>
> Actually, it's very painful. For MySQL, they've accepted the concurrancy
> hit in order to accomplish it -- PostgreSQL would require a more subtle
> approach.
>
> Anyway, with Rules you can force this:
>
> ON INSERT UPDATE counter SET tablecount = tablecount + 1;
>
> ON DELETE UPDATE counter SET tablecount = tablecount - 1;
>
> You need to create a table "counter" with a single row that will keep
> track of the number of rows in the table. Just remember, you've now
> serialized all writes to the table, but in your situation it may be
> worth while.

There's a still more subtle approach that relieves the serialization
constraint, at some cost...

- You add rules that _insert_ a row each time there is an
insert/delete
ON INSERT insert into counts(table, value) values ('our_table', 1);
ON DELETE insert into counts(table, value) values ('our_table', -1);

- The "select count(*) from our_table" is replaced by "select
sum(value) from counts where table = 'our_table'"

- Periodically, a "compression" process goes through and either:

a) Deletes the rows for 'our_table' and replaces them with one
row with a conventionally-scanned 'count(*)' value, or

b) Computes "select table, sum(value) as value from counts group
by table", deletes all the existing rows in counts, and replaces
them by the preceding selection, or

c) Perhaps does something incremental that's like b), but which
only processes parts of the "count" table at once. Process
500 rows, then COMMIT, or something of the sort...

Note that this "counts" table can potentially grow _extremely_ large.
The "win" comes when it gets compressed, so that instead of scanning
through 500K items, it index-scans through 27, the 1 that has the
"497000" that was the state of the table at the last compression, and
then 26 singletons.

A win comes in if an INSERT that adds in 50 rows can lead to
inserting ('our_table', 50) into COUNTS, or a delete that eliminates
5000 rows puts in ('our_table', -5000).

It's vital to run the "compression" reasonably often (much like VACUUM
:-)) in order that the COUNTS summary table stays relatively small.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www3.sympatico.ca/cbbrowne/wp.html
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it. -- Brian W. Kernighan

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Bruce Momjian 2004-01-06 05:25:32 Re: optimizing Postgres queries
Previous Message Christopher Browne 2004-01-06 01:46:00 Re: Select max(foo) and select count(*) optimization