Aggregate functions, fast! (long)

From: Ang Chin Han <angch(at)pintoo(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Aggregate functions, fast! (long)
Date: 2000-08-09 06:53:45
Message-ID: 20000809145345.B5894@pintoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Apologies in advance for the length of this post, but this has
been bugging me for a week or so.

Consider a table with a primary key pk and a list of attributes a, b
and c:

Table t
pk a b c
-----------
1 1 1 1
2 1 2 3
: etc :
9998 1 1 1
9999 2 1 2

If the table is a large but the range of possible values for the
attributes small, aggregate queries such as the following with be
very slow:

1. select min(a) from t;
2. select count(*) from t where a = 1 and b = 3;
3. select sum(d) from t where a = 1 and c > 3;
4. select avg(b) from t where c = 1;

One way of speeding these type of queries is to have a table
summarizing that table t:

Table t_sum
a b c cnt
--------------
1 1 1 2 (This is from pk 1 and 9998 having the same a,b,c)
1 2 3 1
2 1 2 1
: : etc :
with a primary key of (a, b, c) and an integer cnt counting
the number of times a particular combination of (a, b, c)
occuring in table t

The queries making use of these might be rewritten as:
1. select min(a) from t_sum; -- same as above,
-- but we've less rows to scan
2. select cnt from t_sum where a = 1 and b = 3;
3. select sum(d * cnt) as sum from t_sum where a = 1 and c > 3;
4. select sum(b * cnt) / cnt as avg from t_sum where c = 1;

(CAVEAT/BUG: 1. must return cnt = 0 if it doesn't exist in
t_sum
2. rows with cnt = 0 will have to be deleted
immediately or select min(foo) might return the
wrong result)

Now, t_sum can be automatically updated by triggers/rules (I'll
get into this later)

It might seem a bit pointless but I've made the example very
generic to show that this _is_ a generic class of problem.
Specific examples might include:
select count(*) from books where category=x;
select count(*) from articles where category=x and author=y;

My point is it'll be nice if there is an easier mechanism ala
CREATE VIEW as a shortcut (and more!) for
'select x from y where z = a;'

The syntax might look like:
CREATE AGGREGATE INDEX t1_sum on t1 (a, b, c);
which would create the implicit triggers and table.

The hard part is for postgres to have a rule rewrite system
capable of converting the queries. (Perhaps we'll get this
when views-with-aggregate-columns bug is fixed?)

Of course, the performance gain can be achieved if you manually
rewrite your queries to take advantage of the summary table (or
aggregate index, similar to the normal index speeding up ranged
lookups).

And, oh, the rules/triggers: (I used these, and they work great
for some of the tests I did, but I haven't fully debugged these for
all cases, but they definately have bugs 1 & 2 described above.
)

---------- 8<- Cut here ---------
CREATE TABLE t_sum (
a INTEGER,
b INTEGER,
c INTEGER,
cnt INTEGER,
PRIMARY KEY (a, b, c)
);

CREATE RULE t_sum_add_rule AS ON INSERT TO t_sum WHERE
EXISTS (SELECT cnt FROM t_sum WHERE
a = new.a and b = new.b and c = new.c)
DO INSTEAD
UPDATE t_sum set cnt = cnt + 1 WHERE
a = new.a and b = new.b and c = new.c;

CREATE RULE t_insert_rule AS ON INSERT TO t
DO
INSERT INTO t_sum values (new.a, new.b, new.c, 1);

CREATE FUNCTION "t_sum_upd" ( ) RETURNS opaque AS '
begin
update t_sum set cnt = cnt - 1
where t_sum.a = old.a
and t_sum.b = old.b
and t_sum.c = old.c;
insert into t_sum values (new.a, new.b, new.c);
return new;
end;' LANGUAGE 'plpgsql';

CREATE TRIGGER "t_upd" BEFORE UPDATE ON "t"
FOR EACH ROW EXECUTE PROCEDURE "t_sum_upd" ();

CREATE FUNCTION "t_sum_del" ( ) RETURNS opaque AS '
begin
update t_sum set cnt = cnt - 1
where t_sum.a = old.a
and t_sum.b = old.b
and t_sum.c = old.c;
return old;
end;' LANGUAGE 'plpgsql';

CREATE TRIGGER "t_del" BEFORE DELETE ON "t"
FOR EACH ROW EXECUTE PROCEDURE "t_sum_del" ();
---------- 8<- Cut here ---------

P.S. This post is inspired when someone mentioned on the list that
a separate counter might be kept by postgres to speed up some
aggregate functions like select count(*) from t;

P.P.S. Curious how do the commercial RDBMS handle this:
select count(*) from people where gender='m';
when people contains one million rows and gender distribution is
NEARLY 50% male/female?

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Ang Chin Han 2000-08-09 07:33:37 Re: Aggregate functions, fast! (long)
Previous Message Ang Chin Han 2000-08-09 06:50:29 Re: Functions too slow, even with iscachable?