## Aggregate functions, fast! (long)

From: Ang Chin Han pgsql-sql(at)postgresql(dot)org Aggregate functions, fast! (long) 2000-08-09 06:53:45 20000809145345.B5894@pintoo.com (view raw or whole thread) 2000-08-09 06:53:45 from Ang Chin Han  2000-08-09 07:33:37 from Ang Chin Han pgsql-sql
```Apologies in advance for the length of this post, but this has
been bugging me for a week or so.

Consider a table with a primary key pk and a list of attributes a, b
and c:

Table t
pk  a  b  c
-----------
1  1  1  1
2  1  2  3
:   etc  :
9998  1  1  1
9999  2  1  2

If the table is a large but the range of possible values for the
attributes small, aggregate queries such as the following with be
very slow:

1.  select min(a) from t;
2.  select count(*) from t where a = 1 and b = 3;
3.  select sum(d) from t where a = 1 and c > 3;
4.  select avg(b) from t where c = 1;

One way of speeding these type of queries is to have a table
summarizing that table t:

Table t_sum
a  b  c    cnt
--------------
1  1  1     2  (This is from pk 1 and 9998 having the same a,b,c)
1  2  3     1
2  1  2     1
:  : etc    :
with a primary key of (a, b, c) and an integer cnt counting
the number of times a particular combination of (a, b, c)
occuring in table t

The queries making use of these might be rewritten as:
1.  select min(a) from t_sum; -- same as above,
-- but we've less rows to scan
2.  select cnt from t_sum where a = 1 and b = 3;
3.  select sum(d * cnt) as sum from t_sum where a = 1 and c > 3;
4.  select sum(b * cnt) / cnt as avg from t_sum where c = 1;

(CAVEAT/BUG: 1. must return cnt = 0 if it doesn't exist in
t_sum
2. rows with cnt = 0 will have to be deleted
immediately or select min(foo) might return the
wrong result)

Now, t_sum can be automatically updated by triggers/rules (I'll
get into this later)

It might seem a bit pointless but I've made the example very
generic to show that this _is_ a generic class of problem.
Specific examples might include:
select count(*) from books where category=x;
select count(*) from articles where category=x and author=y;

My point is it'll be nice if there is an easier mechanism ala
CREATE VIEW as a shortcut (and more!) for
'select x from y where z = a;'

The syntax might look like:
CREATE AGGREGATE INDEX t1_sum on t1 (a, b, c);
which would create the implicit triggers and table.

The hard part is for postgres to have a rule rewrite system
capable of converting the queries. (Perhaps we'll get this
when views-with-aggregate-columns bug is fixed?)

Of course, the performance gain can be achieved if you manually
aggregate index, similar to the normal index speeding up ranged
lookups).

And, oh, the rules/triggers: (I used these, and they work great
for some of the tests I did, but I haven't fully debugged these for
all cases, but they definately have bugs 1 & 2 described above.
)

---------- 8<- Cut here ---------
CREATE TABLE t_sum (
a INTEGER,
b INTEGER,
c INTEGER,
cnt INTEGER,
PRIMARY KEY (a, b, c)
);

CREATE RULE t_sum_add_rule AS ON INSERT TO t_sum WHERE
EXISTS (SELECT cnt FROM t_sum WHERE
a = new.a and b = new.b and c = new.c)
UPDATE t_sum set cnt = cnt + 1 WHERE
a = new.a and b = new.b and c = new.c;

CREATE RULE t_insert_rule AS ON INSERT TO t
DO
INSERT INTO t_sum values (new.a, new.b, new.c, 1);

CREATE FUNCTION "t_sum_upd" ( ) RETURNS opaque AS '
begin
update t_sum set cnt = cnt - 1
where t_sum.a = old.a
and t_sum.b = old.b
and t_sum.c = old.c;
insert into t_sum values (new.a, new.b, new.c);
return new;
end;' LANGUAGE 'plpgsql';

CREATE TRIGGER "t_upd" BEFORE UPDATE ON "t"
FOR EACH ROW EXECUTE PROCEDURE "t_sum_upd" ();

CREATE FUNCTION "t_sum_del" ( ) RETURNS opaque AS '
begin
update t_sum set cnt = cnt - 1
where t_sum.a = old.a
and t_sum.b = old.b
and t_sum.c = old.c;
return old;
end;' LANGUAGE 'plpgsql';

CREATE TRIGGER "t_del" BEFORE DELETE ON "t"
FOR EACH ROW EXECUTE PROCEDURE "t_sum_del" ();
---------- 8<- Cut here ---------

P.S. This post is inspired when someone mentioned on the list that
a separate counter might be kept by postgres to speed up some
aggregate functions like select count(*) from t;

P.P.S. Curious how do the commercial RDBMS handle this:
select count(*) from people where gender='m';
when people contains one million rows and gender distribution is
NEARLY 50% male/female?

```

### pgsql-sql by date

 Next: From: Ang Chin Han Date: 2000-08-09 07:33:37 Subject: Re: Aggregate functions, fast! (long) Previous: From: Ang Chin Han Date: 2000-08-09 06:50:29 Subject: Re: Functions too slow, even with iscachable?