Re: wip: functions median and percentile

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, David Fetter <david(at)fetter(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wip: functions median and percentile
Date: 2010-10-11 14:03:40
Message-ID: 22629.1286805820@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-rrreviewers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On 10 October 2010 22:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, as far as the implementation issues go, telling tuplesort that it
>> can use gigabytes of memory no matter what seems quite unacceptable.
>> Put this thing into a hash aggregation and you'll blow out your memory
>> in no time. I don't think it's even a good idea to use work_mem there.

> Argh! Yes that sounds like a much more serious problem.

> Interestingly I couldn't seem to produce this effect. Every effort I
> make to write a query to test this with median ends up being executed
> using a GroupAggregate, while the equivalent query with avg uses a
> HashAggregate. I don't understand why they are being treated
> differently.

If you're using recent sources, there's some code in count_agg_clauses()
that assumes that an aggregate with transtype INTERNAL will use
ALLOCSET_DEFAULT_INITSIZE (ie 8K) workspace. So that'll discourage the
planner from selecting HashAggregate except for a pretty small number of
groups. The problem is that there's still a whole lot of daylight
between 8K and 2G, so plenty of room to go wrong.

The other approach that we could take here is to replace the
ALLOCSET_DEFAULT_INITSIZE hack (which is certainly no more than a hack)
with some way for an aggregate to declare how much space it'll eat,
or more simply to mark it as "never use in HashAgg". This was discussed
earlier but it would require a significant amount of dogwork and no one
was real excited about doing it. Maybe it's time to bite that bullet
though. Reflecting on it, I think it'd be best to allow an agg to
provide an estimation function that'd be told the input data type and
expected number of rows --- even on a per-aggregate basis, a constant
estimate just isn't good enough.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-10-11 14:08:53 Re: wip: functions median and percentile
Previous Message Josh Kupershmidt 2010-10-11 13:41:35 Re: column-level update privs + lock table

Browse pgsql-rrreviewers by date

  From Date Subject
Next Message Tom Lane 2010-10-11 14:08:53 Re: wip: functions median and percentile
Previous Message Robert Haas 2010-10-11 10:33:12 Re: wip: functions median and percentile