Re: 9.5: Memory-bounded HashAgg

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-14 18:21:14
Message-ID: 1408040474.2335.186.camel@jeff-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote:
> Oh? So if we have aggregates like array_agg whose memory footprint
> increases over time, the patch completely fails to avoid bloat?

Yes, in its current form.

> I might think a patch with such a limitation was useful, if it weren't
> for the fact that aggregates of that nature are a large part of the
> cases where the planner misestimates the table size in the first place.
> Any complication that we add to nodeAgg should be directed towards
> dealing with cases that the planner is likely to get wrong.

In my experience, the planner has a lot of difficulty estimating the
cardinality unless it's coming from a base table without any operators
above it (other than maybe a simple predicate). This is probably a lot
more common than array_agg problems, simply because array_agg is
relatively rare compared with GROUP BY in general.

Also, there are also cases where my patch should win against Sort even
when it does go to disk. For instance, high enough cardinality to exceed
work_mem, but also a large enough group size. Sort will have to deal
with all of the tuples before it can group any of them, whereas HashAgg
can group at least some of them along the way.

Consider the skew case where the cardinality is 2M, work_mem fits 1M
groups, and the input consists of the keys 1..1999999 mixed randomly
inside one billion zeros. (Aside: if the input is non-random, you may
not get the skew value before the hash table fills up, in which case
HashAgg is just as bad as Sort.)

That being said, we can hold out for an array_agg fix if desired. As I
pointed out in another email, my proposal is compatible with the idea of
dumping groups out of the hash table, and does take some steps in that
direction.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2014-08-14 18:25:31 Re: jsonb format is pessimal for toast compression
Previous Message Andres Freund 2014-08-14 18:21:06 Re: Function to know last log write timestamp