Re: how to identify outliers

From: Scott Bailey <artacus(at)comcast(dot)net>
To:
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: how to identify outliers
Date: 2009-10-27 23:04:47
Message-ID: 4AE77C8F.1000900@comcast.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Rhys A.D. Stewart wrote:
> Im asking how to get the ones that dont fall near the avg.... so for
> example lets say i have the following distances:
> 10,11,12,11,10,9,9,10,11,12,10,11,99
>
> then 99 would be an outlier. the avg would be like 16 or 17 i reckon
> with the 99. so i want a way to find aan outlier, remove it and then
> recalcuate the avg...and then i'd get a 'better' avg.....
>
> i did some seraching about outliers and most of hits are about R or
> spss or some other statistical package.....so looking for a way to do
> it wholly in pgsql.
>
>
> Rhys
>
> On Tue, Oct 27, 2009 at 4:53 PM, Ben Chobot <bench(at)silentmedia(dot)com> wrote:
>> Are you asking how to find the average and standard deviation? Or how to
>> compare the your data against some set values? Perhaps an example would be
>> appropriate; it's not very clear to me what you're asking.
>>
>> Rhys A.D. Stewart wrote:
>>> Hey all,
>>> I have the following table: data(pnum text, distance float8, route text).
>>> I would like to remove the outliers in distance, i.e. lets say i get
>>> the avg dist of pnum for each route and the std deviation of the
>>> distance what is the best way to identify the outliers?
>>>
>>>
>>> Rhys.

Oh, so you want to "cook" your data? I don't agree with that
conceptually, but:

WITH base AS (
SELECT random(1, 100)::int AS i
FROM generate_series(1, 100) i
),
stats AS (
SELECT avg(i) AS dist_avg, stddev(i) AS dist_dev FROM base
)
SELECT count(i), avg(i) AS new_avg, MIN(stats.dist_avg) AS old_avg,
stddev(i) AS new_dev, MIN(stats.dist_dev) AS old_dev
FROM base, stats
WHERE base.i BETWEEN stats.dist_avg - dist_dev
AND stats.dist_avg + dist_dev

Scott Bailey

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2009-10-27 23:18:22 Re: Slow running query with views...how to increase efficiency? with index?
Previous Message Alvaro Herrera 2009-10-27 22:56:04 Re: how to identify outliers