Skip site navigation (1) Skip section navigation (2)

Re: wip: functions median and percentile

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wip: functions median and percentile
Date: 2010-10-05 13:08:59
Message-ID: AANLkTi=HsAFyTusQD=VfFCMK2YAWNKVk7+cpi7S1=cFS@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-rrreviewers
On 5 October 2010 13:14, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
> 2010/10/5 Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>:
>> On 5 October 2010 07:04, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
>> Extrapolating from few quick timing tests, even in the best case, on
>> my machine, it would take 7 days for the running median to use up
>> 100MB, and 8 years for it to use 2GB. So setting the tuplesort's
>> workMem to 2GB (only in the running median case) would actually be
>> safe in practice, and would prevent the temp file leak (for a few
>> years at least!). I feel dirty even suggesting that. Better ideas
>> anyone?
>
> So, I suggested to implement median as a *pure* window function aside
> from Pavel's aggregate function, and Greg suggested insertion
> capability of tuplesort. By this approach, we keep tuplesort to hold
> all the values in the current frame and can release it on the last of
> a partition (it's possible by window function API.) This is
> incremental addition of values and is far better than O(n^2 log(n))
> although I didn't estimate the order. Only when the frame head is
> moving down, we should re-initialize tuplesort and it is as slow as
> calling aggregate version per each row (but I think we can solve it
> somehow if looking precisely).
>

Possibly, but that sounds like a lot of work to get an efficient
algorithm. The 3 cases I see are:

1). Simple aggregate. Current algorithm is O(n log(n)) which is OK. It
could be better because a full sort is not strictly needed. As already
mentioned, a quickselect would be O(n).

2). Window without ORDER BY. This is actually basically the same as
(1), but called once per partition.

3). Window with ORDER BY (running median). The simplest algorithm is
O(n^2 log(n)). It could be tweaked to use an insertion sort, but that
would still be O(n^2), which is not a lot better for all the effort
that would be involved. In theory (perhaps with some kind of tree) it
ought to be possible to come up with an O(n log(n)) algorithm, but
that would be a lot of work.

In the meantime, the attached variation of the patch fixes the temp
file issue and will support all 3 cases. It gives OK performance for
(1) and (2), and poor performance for (3). That could be viewed as a
future development task, which perhaps the window function API would
help with. I think it would be a shame to drop support for (2) just
because we can't do (3) efficiently yet.

Regards,
Dean


> Regards,
>
> --
> Hitoshi Harada
>

Attachment: median3.diff
Description: text/x-patch (21.3 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Robert HaasDate: 2010-10-05 13:12:05
Subject: Re: Insertion of geometric type column with column[0], column[1] and etc.
Previous:From: Stephen FrostDate: 2010-10-05 13:04:55
Subject: Re: Insertion of geometric type column with column[0],column[1] and etc.

pgsql-rrreviewers by date

Next:From: Tom LaneDate: 2010-10-10 21:16:59
Subject: Re: wip: functions median and percentile
Previous:From: Hitoshi HaradaDate: 2010-10-05 12:14:24
Subject: Re: wip: functions median and percentile

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group