Re: stddev returns 0 when there is one row

From: Murthy Kambhampaty <murthy(dot)kambhampaty(at)goeci(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joe Conway <mail(at)joeconway(dot)com>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>, Bruno Wolff III <bruno(at)wolff(dot)to>, pgsql-general(at)postgresql(dot)org
Subject: Re: stddev returns 0 when there is one row
Date: 2003-04-21 16:05:12
Message-ID: 2D92FEBFD3BE1346A6C397223A8DD3FC0921F7@THOR.goeci.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

If we could step back for second and look at why you would want to calculate
a standard deviation in the first place -- i.e., to measure the "spread", or
"dispersion", in the measure represented by a given field, we might get a
better answer than looking at the specs. for somebody else' software.

If you were looking at a table of sample data, and you had a sample size of
one, then the result should, of course be null, since a sample size of one
gives you no information about the variance and standard deviation (standard
error, really) of the distribution from which the sample was drawn.

But we are talking about a relational database here, where it is common to
throw away redundant information, and multiple draws from a deterministic
process, with a variance and standard deviation of zero, are redundant (did
you ever do a "select distinct from <>"?). So, it is entirely likely that
data stored in postgresql will only have include row for data drawn from a
"distribution with zero standard deviation".

The STDEV_POP and STDEV_SMPL convention is not strictly right for the
context, since we throw away "redundant" records regardless of whether they
represent a sample or a population; so it's no surprise that postgresql is
programmed the way it is: if your single record may not really represent a
single draw on a distribution, so it's the user's job to program your SQL to
return NULL for std. dev. and variance.

Of course, multiple identical records aren't redundant when they describe
statistical data, so you should be keeping a count() column around to record
the "frequency" with which each distinct set of values occurs, and using the
variance and standard deviation formulas for "grouped data" - e.g., ( sum(
X**2 * freq) - sum( X *freq)*sum( X*freq))/(sum( freq) -1) for the sample
variance, where the column "X" records the data values, and the column
"freq" records the number of times the values is repeated in the sample.
Tom, would it be too much trouble to program these in (as STDEV_POP_G and
STDEV_SMPL_G or some variant?

Thanks,
Murthy

-----Original Message-----
From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
Sent: Saturday, April 19, 2003 16:33
To: Joe Conway
Cc: Manfred Koizar; Bruno Wolff III; pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] stddev returns 0 when there is one row

Joe Conway <mail(at)joeconway(dot)com> writes:
> So I'd take it that PostgreSQL's STDDEV implements STDDEV_POP.

No, we implement the sample standard deviation, as stated in the docs:
http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/functions-aggreg
ate.html

The code is pretty straightforward, at least in the float8 case:

/* We define STDDEV of no values to be NULL, of 1 value to be 0 */
if (N == 0.0)
PG_RETURN_NULL();

if (N <= 1.0)
PG_RETURN_FLOAT8(0.0);

numerator = N * sumX2 - sumX * sumX;

/* Watch out for roundoff error producing a negative numerator */
if (numerator <= 0.0)
PG_RETURN_FLOAT8(0.0);

PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0))));

I don't have a real strong feeling about whether we should change the
behavior at N=1 or not. Does the SQL200x spec provide any guidance?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Josh Berkus 2003-04-21 16:14:43 Re: [SQL] Yet Another (Simple) Case of Index not used
Previous Message Joe Conway 2003-04-21 16:01:24 Re: Using TEXT columns for binary content