Re: wip: functions median and percentile

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wip: functions median and percentile
Date: 2010-10-05 13:08:59
Message-ID: AANLkTi=HsAFyTusQD=VfFCMK2YAWNKVk7+cpi7S1=cFS@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-rrreviewers

On 5 October 2010 13:14, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
> 2010/10/5 Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>:
>> On 5 October 2010 07:04, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
>> Extrapolating from few quick timing tests, even in the best case, on
>> my machine, it would take 7 days for the running median to use up
>> 100MB, and 8 years for it to use 2GB. So setting the tuplesort's
>> workMem to 2GB (only in the running median case) would actually be
>> safe in practice, and would prevent the temp file leak (for a few
>> years at least!). I feel dirty even suggesting that. Better ideas
>> anyone?
>
> So, I suggested to implement median as a *pure* window function aside
> from Pavel's aggregate function, and Greg suggested insertion
> capability of tuplesort. By this approach, we keep tuplesort to hold
> all the values in the current frame and can release it on the last of
> a partition (it's possible by window function API.) This is
> incremental addition of values and is far better than O(n^2 log(n))
> although I didn't estimate the order. Only when the frame head is
> moving down, we should re-initialize tuplesort and it is as slow as
> calling aggregate version per each row (but I think we can solve it
> somehow if looking precisely).
>

Possibly, but that sounds like a lot of work to get an efficient
algorithm. The 3 cases I see are:

1). Simple aggregate. Current algorithm is O(n log(n)) which is OK. It
could be better because a full sort is not strictly needed. As already
mentioned, a quickselect would be O(n).

2). Window without ORDER BY. This is actually basically the same as
(1), but called once per partition.

3). Window with ORDER BY (running median). The simplest algorithm is
O(n^2 log(n)). It could be tweaked to use an insertion sort, but that
would still be O(n^2), which is not a lot better for all the effort
that would be involved. In theory (perhaps with some kind of tree) it
ought to be possible to come up with an O(n log(n)) algorithm, but
that would be a lot of work.

In the meantime, the attached variation of the patch fixes the temp
file issue and will support all 3 cases. It gives OK performance for
(1) and (2), and poor performance for (3). That could be viewed as a
future development task, which perhaps the window function API would
help with. I think it would be a shame to drop support for (2) just
because we can't do (3) efficiently yet.

Regards,
Dean

> Regards,
>
> --
> Hitoshi Harada
>

Attachment Content-Type Size
median3.diff text/x-patch 21.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-10-05 13:12:05 Re: Insertion of geometric type column with column[0], column[1] and etc.
Previous Message Stephen Frost 2010-10-05 13:04:55 Re: Insertion of geometric type column with column[0], column[1] and etc.

Browse pgsql-rrreviewers by date

  From Date Subject
Next Message Tom Lane 2010-10-10 21:16:59 Re: wip: functions median and percentile
Previous Message Hitoshi Harada 2010-10-05 12:14:24 Re: wip: functions median and percentile