Skip site navigation (1) Skip section navigation (2)

Re: wip: functions median and percentile

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, David Fetter <david(at)fetter(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wip: functions median and percentile
Date: 2010-10-10 21:16:59
Message-ID: 17482.1286745419@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-rrreviewers
Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> In the meantime, the attached variation of the patch fixes the temp
> file issue and will support all 3 cases. It gives OK performance for
> (1) and (2), and poor performance for (3). That could be viewed as a
> future development task, which perhaps the window function API would
> help with. I think it would be a shame to drop support for (2) just
> because we can't do (3) efficiently yet.

I started looking at this patch, and noticed that we got so caught up
in implementation issues that we forgot the unresolved problem of data
types.  The original patch defined median(anyelement), which is only
well-defined for an odd number of inputs; for an even number of inputs
you have to take the left or right item in the central pair.  I see
that this version defines
	median(float8) returns float8
	median(float4) returns float8
	median(numeric) returns numeric
	median(int8) returns numeric
	median(int4) returns numeric
	median(int2) returns numeric
which strikes me as possibly being overkill.

It was pointed out upthread that while median isn't presently
in the standard, Oracle defines it in terms of percentile_cont(0.5)
which *is* in the standard.  What I read in SQL:2008 is that
percentile_cont is defined for all numeric types (returning
approximate numeric with implementation-defined precision),
and for interval (returning interval), and not for any other
input type.  So it appears to me that what we ought to support
is
	median(float8) returns float8
	median(interval) returns interval
and nothing else --- we can rely on implicit casting to convert
any other numeric input type to float8.

BTW, as far as the implementation issues go, telling tuplesort that it
can use gigabytes of memory no matter what seems quite unacceptable.
Put this thing into a hash aggregation and you'll blow out your memory
in no time.  I don't think it's even a good idea to use work_mem there.
I wonder whether it'd be a good idea to augment AggCheckCallContext()
so that there's a way for aggregates to find out how much memory they
ought to try to use.  In a simple aggregation situation it's probably
OK to use work_mem, but in a hash aggregation you'd better use less
--- perhaps work_mem divided by the number of groups expected.

Also, I believe that the lack-of-cleanup problem for tuplesorts spilling
to disk should be fixable by using an exprcontext shutdown callback (see
RegisterExprContextCallback).

Comments?

			regards, tom lane

In response to

Responses

pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2010-10-10 21:24:17
Subject: Re: Debugging initdb breakage
Previous:From: Dimitri FontaineDate: 2010-10-10 20:38:01
Subject: Debugging initdb breakage

pgsql-rrreviewers by date

Next:From: Pavel StehuleDate: 2010-10-11 06:58:53
Subject: Re: wip: functions median and percentile
Previous:From: Dean RasheedDate: 2010-10-05 13:08:59
Subject: Re: wip: functions median and percentile

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group