finding medians

From: Josh Burdick <jburdick(at)gradient(dot)cis(dot)upenn(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Cc: josh(at)agliodbs(dot)com
Subject: finding medians
Date: 2002-05-30 20:16:43
Message-ID: 3CF688AB.4010005@gradient.cis.upenn.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At the end of this message is some code I used to find medians.
It's kind of a hack, but approximately works, and is intended as a
somewhat awkward stopgap for people who need to use medians. It
illustrates the limitations of the current aggregate function setup,
which works so nicely for avg() and stddev().
I don't have any good solutions. I tried using a float4[] to store
each element as it's added, but I couldn't get array updates working in
PL/PgSQL, so that didn't help.
Perhaps aggregate functions could be passed an array? Or a cursor,
pointing at the first line? I'm not sure.

Anyways, perhaps it'll be helpful.
Josh

--
Josh Burdick
jburdick(at)gradient(dot)cis(dot)upenn(dot)edu
http://www.cis.upenn.edu/~jburdick

/* Implementing median-finding in "pure Postgres." Does this by
copying data to a temporary table.

A weakness of this code is that it uses sorting, instead of
Hoare's linear-time median algorithm. Presumably sorting is
implemented so efficiently that it'll be faster than anything
written in PL/PgSQL. (Although Hoare's algorithm implemented
in C would be faster than either.)

BUG: this isn't properly set up to deal with multiple users.
For example, if A computes a median, then B could read the data
from the median_tmp table. Possibly you could fiddle with
transaction isolation levels, or add a user field to median_tmp,
or something else complicated, to prevent this, but for now I'm
not worrying about this.

Written by Josh Burdick (jburdick(at)gradient(dot)cis(dot)upenn(dot)edu).
Anyone can use this under the same license as Postgres.

20020524, jtb: started. */

drop aggregate median(float4);
drop table median_tmp;
drop sequence median_id;
drop index median_tmp_median_id;
drop function median_sfunc_float4(bigint, float4);
drop function median_finalfunc_float4(bigint);

create sequence median_id;
create table median_tmp (
median_id int,
x float4
);
create index median_tmp_median_id on median_tmp(median_id);

create function median_sfunc_float4
(bigint, float4) returns bigint as '

insert into median_tmp
values (case when $1 = 0 then nextval(''median_id'') else $1 end, $2);

select currval(''median_id'');

' language 'SQL';

create function median_finalfunc_float4
(bigint) returns float4 as '
declare

i bigint;
n bigint;
c refcursor;
m float4;
m1 float4;

begin

n := (select count(*) from median_tmp where median_id = $1);

open c for select x from median_tmp where median_id = $1 order by x;

for i in 1..((n+1)/2) loop
fetch c into m;
end loop;

/* if n is even, fetch the next value, and average the two */
if (n % int8(2) = int8(0)) then
fetch c into m1;
m := (m + m1) / 2;
end if;

delete from median_tmp where median_id = $1;

return m;

end
' language 'plpgsql';

create aggregate median (
basetype = float4,
stype = bigint,
initcond = 0,
sfunc = median_sfunc_float4,
finalfunc = median_finalfunc_float4
);

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2002-05-30 20:30:09 Re: finding medians
Previous Message Hannu Krosing 2002-05-30 20:16:41 Re: finding medians