Re: Implementing STDDEV and VARIANCE

From: JanWieck(at)t-online(dot)de (Jan Wieck)
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Jeroen van Vianen <jeroen(at)design(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Implementing STDDEV and VARIANCE
Date: 2000-06-09 10:39:51
Message-ID: 200006091039.MAA04186@hot.jw.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian wrote:

I created all that sometimes back. Dunno why never added it
to contrib. Will post it another day.

Jan

> Has this gone anywhere?
>
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
> >
> > There already has been some discussion on implementing this, see
> > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
> >
> > There are two definitions for standard deviation and variance:
> > _
> > population variance = sigma^2 = SUM(X - X)^2 / N
> >
> > population stddev = sqrt(population variance)
> > _
> > sample variance = s^2 = SUM(X - X)^2 / (N-1)
> >
> > sample stddev = sqrt(sample variance)
> >
> > These statistics can be calculated in one pass when three variables are
> > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> > two variables are kept.
> >
> > E.g. avg() is calculated as follows:
> >
> > sx = 0
> > n = 0
> > for every row {
> > sx = sx + value in row // transition function 1
> > n = n+1 // transition function 2
> > }
> > avg = sum(x) / n
> >
> > stddev / variance might be calculated as follows:
> >
> > sx = 0
> > n = 0
> > sx2 = 0
> > for every row {
> > sx = sx + value in row // transition function 1
> > n = n+1 // transition function 2
> > sx2 = sx2 + value in row^2 // transition function 3
> > }
> > var = (1/n) * (sx2 - (1/n) * sx^2) // Population
> >
> > or
> >
> > var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
> >
> > and
> >
> > stddev = sqrt(var)
> >
> > I've looked through the code and the following things need to be implemented:
> >
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
> >
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> > >All that you need to implement this is room to keep two running
> > >sums instead of one. I haven't looked at pgsql's aggregate functions,
> > >but I'd hope that the working state can be a struct not just a
> > >single number.
> >
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
> >
> > 2. Add logic to nodeAgg.c to execute the third transition function and
> > finalize function with three rather than two parameters
> > 3. Add functions f(a,b) that returns a + b^2 for selected types
> > 4. Add four finalize functions to calculate the variance / stddev
> > 5. Update the code for create aggregate, to include the definition of the
> > third transition function
> > 6. Update the documentation
> >
> > My questions are:
> > 1. Is this the correct way to continue? What am I missing? Any errors in my
> > reasoning?
> > 2. I am proposing the names stddev(x) and variance(x) for population and
> > samplestddev(x) and
> > samplevariance(x) for sample statistics. Any comments?
> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
> >
> > Let me know what you think,
> >
> >
> > Jeroen
> >
> > ************
> >
>
>
> --
> Bruce Momjian | http://www.op.net/~candle
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
>

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2000-06-09 11:04:27 Re: AW: Proposal: TRUNCATE TABLE table RESTRICT
Previous Message Bruce Momjian 2000-06-09 10:32:06 Re: freefuncs.c is never called from anywhere!?