Re: [SQL] `statistical' aggregate functions etc.

From: jwieck(at)debis(dot)com (Jan Wieck)
To: umehlig(at)uni-bremen(dot)de (Ulf Mehlig)
Cc: pgsql-sql(at)postgreSQL(dot)org
Subject: Re: [SQL] `statistical' aggregate functions etc.
Date: 1998-10-22 10:04:50
Message-ID: m0zWHbu-000EBPC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

> I'm a novice to postgresql, so please excuse if I'm addressing to the
> wrong mailing list!
>
> I need to specify some column functions for statistical analysis,
> e.g. standard deviation or variance. After reading the user's guideI
> think that's to realize with aggregates, but with those docs only I
> cannot figure out how to do it -- does anybody know whether there is a
> ready-for-use "statistics" package, or how to write something that's
> usable in a SQL statement like this:
>
> select date, avg(temperature), stddev(temperature)
> from temperatures
> where date between '01.01.1999' and '15.02.1999'
> group by date ?
>

The example below works and produces

select a, count(b) as n, average(b),
variance(b), stddev(b) from t1 group by a;
a| n|average| variance| stddev
-+--+-------+----------------+--------------
1| 5| 6.4| 0.8| 0.894427191
2|10| 3.53| 0.0401111111111|0.200277585144
3|10| 3.56|0.00488888888889|0.069920589878
(3 rows)

but it is slow. On my good old 486DX4 100MHz here in the
office it runs approx. 80 seconds if there are 6400 rows in
t1. If there are only a few rows to output, most of the time
is spent in the collector. So you would need at least to
create a new postgres basetype that holds two float8's
internal where the external representation is a Tcl list and
replace the stat_collect() function by a C language one. But
when already doing so, implementing the others in C too isn't
that much more work.

Doing it all in C would require 6 functions to implement.
The C equivalents of the four functions below plus the input
and output functions for the new data type.

Call me on +49 40 5395-1404 (office) if you have trouble or
don't know how to do it at all.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck(at)debis(dot)com (Jan Wieck) #

Statistical aggregate example in PL/Tcl:

create table t1 (a int4, b float8);

insert into t1 values (1, 5.0);
insert into t1 values (1, 6.0);
insert into t1 values (1, 7.0);
insert into t1 values (1, 7.0);
insert into t1 values (1, 7.0);

insert into t1 values (2, 3.3);
insert into t1 values (2, 3.5);
insert into t1 values (2, 3.7);
insert into t1 values (2, 3.2);
insert into t1 values (2, 3.6);
insert into t1 values (2, 3.5);
insert into t1 values (2, 3.6);
insert into t1 values (2, 3.4);
insert into t1 values (2, 3.6);
insert into t1 values (2, 3.9);

insert into t1 values (3, 3.5);
insert into t1 values (3, 3.6);
insert into t1 values (3, 3.6);
insert into t1 values (3, 3.7);
insert into t1 values (3, 3.5);
insert into t1 values (3, 3.6);
insert into t1 values (3, 3.5);
insert into t1 values (3, 3.5);
insert into t1 values (3, 3.6);
insert into t1 values (3, 3.5);

create function stat_collect (text, float8) returns text as '
if {[llength $1] == 0} {
set sum [expr double($2)]
set qsum [expr double($2) * double($2)]
} else {
set sum [expr [lindex $1 0] + double($2)]
set qsum [expr [lindex $1 1] + double($2) * double($2)]
}
return [list $sum $qsum]
' language 'pltcl';

create function stat_average (float8, int4) returns float8 as '
if {$2 == 0} {return 0.0}

return [expr double($1) / double($2)]
' language 'pltcl';

create function stat_variance (text, int4) returns float8 as '
if {$2 < 2} {return 0.0}

set n [expr double($2)]
set sum [lindex $1 0]
set qsum [lindex $1 1]

set average [expr $sum / $n]
set avgqsum [expr $average * $average * $n]
set variance [expr ($qsum - $avgqsum) / ($n - 1.0)]

return $variance
' language 'pltcl';

create function stat_stddev (text, int4) returns float8 as '
if {$2 < 2} {return 0.0}

set n [expr double($2)]
set sum [lindex $1 0]
set qsum [lindex $1 1]

set average [expr $sum / $n]
set avgqsum [expr $average * $average * $n]
set variance [expr ($qsum - $avgqsum) / ($n - 1.0)]
set stddev [expr sqrt($variance)]

return $stddev
' language 'pltcl';

create aggregate average (
basetype = float8,
stype1 = float8,
stype2 = int4,
sfunc1 = float8pl,
sfunc2 = int4inc,
finalfunc = stat_average,
initcond1 = '0.0',
initcond2 = '0'
);

create aggregate variance (
basetype = float8,
stype1 = text,
stype2 = int4,
sfunc1 = stat_collect,
sfunc2 = int4inc,
finalfunc = stat_variance,
initcond1 = '',
initcond2 = '0'
);

create aggregate stddev (
basetype = float8,
stype1 = text,
stype2 = int4,
sfunc1 = stat_collect,
sfunc2 = int4inc,
finalfunc = stat_stddev,
initcond1 = '',
initcond2 = '0'
);

select a, count(b) as n, average(b),
variance(b), stddev(b) from t1 group by a;

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Philippe Rochat (RSR: 318 17 93) 1998-10-22 10:37:35 Time format ? (Really microsecond ??)
Previous Message Aleksey Dashevsky 1998-10-22 09:45:50 Re: [SQL] `statistical' aggregate functions etc.