Re: [HACKERS] Standard Deviation function.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Standard Deviation function.
Date: 1998-06-05 15:24:04
Message-ID: 28715.897060244@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andreas Zeugswetter <andreas(dot)zeugswetter(at)telecom(dot)at> writes:
> Wow, this is it. But as I said, the above line is wrong (By the way:
> this is a very common mistake).
> It should read:
> $self->{standard_deviation} = sqrt( $self->{pseudo_variance} / $self->{count} )
> Note: The - 1 is missing

The formula with N-1 in the divisor is correct for the "sample standard
deviation". That is what you use when your N data points represent a
sample from a larger population, and you want to estimate the standard
deviation of the whole population.

If your N data points in fact are the *whole* population of interest,
then you calculate the "population standard deviation" which has just N
in the divisor. So both versions of the formula are correct depending
on the situation, and you really ought to provide both.

(To justify the difference intuitively: if you have exactly one data
point, and it is the *whole* population, then the mean equals the
data value and the standard deviation is zero. That is what you get
with N in the divisor. But if your one data point is a sample from
a larger population, you cannot estimate the population's standard
deviation; you need more data. The N-1 equation gives 0/0 in this
case, correctly signifying that the value is indeterminate.)

I think the Perl code given earlier in the thread pretty much sucks
from a numerical accuracy point of view. The running mean calculation
suffers from accumulation of errors, and that propagates into the
pseudo-variance in a big way. It's particularly bad if the data is
tightly clustered about the mean; the code ends up doing lots of
subtractions of nearly equal values.

The accepted way to do sample standard deviation in one pass is this:

STDDEV = SQRT( (N*SIGMA(Xi^2) - SIGMA(Xi)^2) / (N*(N-1)) )

where N is the number of data points and SIGMA(Xi) means the sum
of the data values Xi. You keep running sums of Xi and Xi^2 as
you pass over the data, then you apply the above equation once
at the end. (For population standard deviation, you use N^2 as
the denominator. For variance, you just leave off the SQRT().)

All that you need to implement this is room to keep two running
sums instead of one. I haven't looked at pgsql's aggregate functions,
but I'd hope that the working state can be a struct not just a
single number.

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message The Hermit Hacker 1998-06-05 16:08:32 Re: [HACKERS] NEW POSTGRESQL LOGOS
Previous Message Matthew N. Dodd 1998-06-05 15:23:03 Re: [HACKERS] NEW POSTGRESQL LOGOS