Re: Memory-Bounded Hash Aggregation

From: Adam Lee <ali(at)pivotal(dot)io>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2019-12-20 09:16:26
Message-ID: 20191220091626.GD45905@mars.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 14, 2019 at 06:32:25PM +0100, Tomas Vondra wrote:
> I've done a bit more testing on this, after resolving a couple of minor
> conflicts due to recent commits (rebased version attached).
>
> In particular, I've made a comparison with different dataset sizes,
> group sizes, GUC settings etc. The script and results from two different
> machines are available here:
>
> The script essentially runs a simple grouping query with different
> number of rows, groups, work_mem and parallelism settings. There's
> nothing particularly magical about it.

Nice!

> I did run it both on master and patched code, allowing us to compare
> results and assess impact of the patch. Overall, the changes are
> expected and either neutral or beneficial, i.e. the timing are the same
> or faster.
>
> The number of cases that regressed is fairly small, but sometimes the
> regressions are annoyingly large - up to 2x in some cases. Consider for
> example this trivial example with 100M rows:

I suppose this is because the patch has no costing changes yet. I hacked
a little to give hash agg a spilling punish, just some value based on
(groups_in_hashtable * num_of_input_tuples)/num_groups_from_planner, it
would not choose hash aggregate in this case.

However, that punish is wrong, because comparing to the external sort
algorithm, hash aggregate has the respilling, which involves even more
I/O, especially with a very large number of groups but a very small
number of tuples in a single group like the test you did. It would be a
challenge.

BTW, Jeff, Greenplum has a test for hash agg spill, I modified a little
to check how many batches a query uses, it's attached, not sure if it
would help.

--
Adam Lee

Attachment Content-Type Size
hashagg_spill.sql text/plain 2.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jehan-Guillaume de Rorthais 2019-12-20 09:55:25 Re: How is this possible "publication does not exist"
Previous Message Mark Lorenz 2019-12-20 08:38:16 Re: Created feature for to_date() conversion using patterns 'YYYY-WW', 'YYYY-WW-D', 'YYYY-MM-W' and 'YYYY-MM-W-D'