Re: Median/Quantile Aggregate

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Orme <d(dot)orme(at)imperial(dot)ac(dot)uk>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Median/Quantile Aggregate
Date: 2005-05-17 16:31:36
Message-ID: 29177.1116347496@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

David Orme <d(dot)orme(at)imperial(dot)ac(dot)uk> writes:
> I know people have asked about this before but I can't find a working
> solution on the web - I've found code for specific instances of
> calculating medians but I need a general aggregate function for
> calculating medians, or more generally any given quantile.

> The kind of thing I need to do is to be able to extract the median
> value from a table of 4 million rows, aggregating across more than
> 50,000 grouping values - the sort of thing that is really easy to do
> for averaging:

> SELECT grid_id, avg(rs) FROM behr_grid GROUP BY grid_id;

You could build a custom aggregate for this. array_append()
will do fine as the transition function, so all you really need to
write is a final function that sorts the given array and then picks
out the middle (or appropriate-quantile) element.

In fact, you could cheat a bit and let the system do the sorting for
you:

SELECT grid_id, myagg(rs)
FROM (SELECT grid_id, rs FROM behr_grid ORDER BY grid_id, rs) ss
GROUP BY grid_id;

If the aggregate is only used in a context like this, it will always
see presorted input and so it can just pull out the middle element.
(Note: I think this trick only works in PG 7.4 and later.)

So, lightly tested:

regression=# create function get_middle(anyarray) returns anyelement as
regression-# 'declare n integer;
regression'# begin
regression'# n := (array_lower($1, 1) + array_upper($1, 1)) / 2;
regression'# return $1[n];
regression'# end' language plpgsql;
CREATE FUNCTION
regression=# create aggregate sortedmedian(
regression(# sfunc = array_append,
regression(# finalfunc = get_middle,
regression(# basetype = anyelement,
regression(# stype = anyarray,
regression(# initcond = '{}'
regression(# );
CREATE AGGREGATE
regression=# select hundred, min(thousand), max(thousand), sortedmedian(thousand) from
regression-# (select hundred, thousand from tenk1 order by 1,2) ss
regression-# group by hundred;
hundred | min | max | sortedmedian
---------+-----+-----+--------------
0 | 0 | 900 | 400
1 | 1 | 901 | 401
2 | 2 | 902 | 402
3 | 3 | 903 | 403
4 | 4 | 904 | 404
5 | 5 | 905 | 405
6 | 6 | 906 | 406
7 | 7 | 907 | 407
8 | 8 | 908 | 408
9 | 9 | 909 | 409
10 | 10 | 910 | 410
...

regards, tom lane

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Adam Witney 2005-05-17 16:34:16 Re: Median/Quantile Aggregate
Previous Message Sean Davis 2005-05-17 16:17:07 Re: Median/Quantile Aggregate