Re: Implementing STDDEV and VARIANCE

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Implementing STDDEV and VARIANCE
Date: 2000-09-30 02:27:04
Message-ID: 200009300227.WAA02615@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeroen, not sure if you were involved in this, but standard deviation is
in the current development tree, and will be released in 7.1 in a few
months.

> I'd like to implement stddev and variance aggregates in Postgres. This is a
> long standing TODO item.
>
> There already has been some discussion on implementing this, see
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
>
> There are two definitions for standard deviation and variance:
> _
> population variance = sigma^2 = SUM(X - X)^2 / N
>
> population stddev = sqrt(population variance)
> _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
>
> sample stddev = sqrt(sample variance)
>
> These statistics can be calculated in one pass when three variables are
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> two variables are kept.
>
> E.g. avg() is calculated as follows:
>
> sx = 0
> n = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> }
> avg = sum(x) / n
>
> stddev / variance might be calculated as follows:
>
> sx = 0
> n = 0
> sx2 = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> sx2 = sx2 + value in row^2 // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2) // Population
>
> or
>
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
>
> and
>
> stddev = sqrt(var)
>
> I've looked through the code and the following things need to be implemented:
>
> 1. Add three columns to pg_aggregate for the additional third transition
> function.
>
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >All that you need to implement this is room to keep two running
> >sums instead of one. I haven't looked at pgsql's aggregate functions,
> >but I'd hope that the working state can be a struct not just a
> >single number.
>
> I saw no other way than adding another transition function and logic, as
> this might break user-defined aggregates (are there any around?).
>
> 2. Add logic to nodeAgg.c to execute the third transition function and
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the
> third transition function
> 6. Update the documentation
>
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any
> other types also? int[2,4,8] don't seem logical, as these would introduce
> serious rounding errors.
>
> Let me know what you think,
>
>
> Jeroen
>
> ************
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2000-09-30 02:28:53 7.1 beta schedule
Previous Message Bruce Momjian 2000-09-30 02:18:18 Re: uniqueness not always correct