On 17 May 2005, at 17:15, Gordon Haverland wrote:
> On Tuesday 17 May 2005 09:46, you wrote:
>> I know people have asked about this before but I can't find a
>> working solution on the web - I've found code for specific
>> instances of calculating medians but I need a general aggregate
>> function for calculating medians, or more generally any given
>> The kind of thing I need to do is to be able to extract the
>> median value from a table of 4 million rows, aggregating across
>> more than 50,000 grouping values - the sort of thing that is
>> really easy to do for averaging:
>> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;
>> From what I've seen on the web, I know this isn't trivial or
>> necessarily fast but at the moment I'm reading the data out
>> into something that calculates medians and then matching it
>> back in and this doesn't feel particularly elegant!
> This is all from a numerical methods point of view, not a
> dbase/SQL point of view.
> Most of the quantile stuff I remember, at some point involves
> sorting and then binning. A person can sort all of the data
> points, and be sure to find whatever quantiles you are interested
> in (possibly using interpolation between 2 values which are close
> to the desired quantile).
> In your case, 4 million rows really isn't too many values, and so
> you probably could do this "in core". But, it may be you are
> headed towards larger problems, where the "out of core" methods
> might be more useful.
Actually the size of the table should be fairly static - there is the
possibility that it could increase to 16 million though!
> With the out of core methods, you are going to pass over the data
> more than once. There may be a special "first pass" in order to
> obtain some kind of estimate for the value you are looking for
> and a range to look over. If you are looking for a median and
> know the average (and hence presumably the standard deviation as
> well), you can look over a window centered on the mean with a
> width that is some fraction of the standard deviation (or
> standard error). The next pass over the data just counts how
> many values are below the lower limit, how many values are above
> the upper limit, and copies into core (or a temporary table) all
> the values within the window of interest. Then the windowed
> values are sorted (hopefully in core), and the desired quantile
> calculated. The smaller the window, the fewer values copied out,
> the faster the sort goes. But, smaller windows also mean greater
> chance that the quantile you are interested in isn't within the
> window range you are using. Which means at least one more pass
> through the data with new lower and upper limits.
Neat idea and I can certainly see the advantage. My worry is that the
median (and more generally any quantile) are usually applied to
variables that depart from the expectations of any particular
canonical distribution. This is certainly the case with my data, so
my feeling that a potentially slower sort/bin approach is safer.
> If you happen to know the underlying statistical distribution of
> the data, it may be that the functional form of that distribution
> can be calculated from a few statistics. For example: the
> Gaussian distribution can be calculated from the mean and
> variance. With a functional form, you may be able to
> analytically calculate whatever quantile is of interest. Of
> course, this presupposes that there are no outliers in your data.
That is at the root of the problem - I'm dealing with non-normal
variables (strongly right skewed in this instance) and so any
approach using a mean and standard deviation to approximate a
quantile is going to go badly astray!
Thanks for the response,
In response to
pgsql-novice by date
|Next:||From: Buddy Shearer||Date: 2005-05-17 16:45:16|
|Subject: Re: BAD SU operator at Postgresql startup (during boot) |
|Previous:||From: Adam Witney||Date: 2005-05-17 16:34:16|
|Subject: Re: Median/Quantile Aggregate|