Re: Default setting for enable_hashagg_disk

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-06-10 01:20:13
Message-ID: CAAKRu_aisiENMsPM2gC4oUY1hHG3yrCwY-fXUg22C6_MJUwQdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Thu, Apr 9, 2020 at 1:02 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> 2. enable_groupingsets_hash_disk (default false):
>
> This is about how we choose which grouping sets to hash and which to
> sort when generating mixed mode paths.
>
> Even before this patch, there are quite a few paths that could be
> generated. It tries to estimate the size of each grouping set's hash
> table, and then see how many it can fit in work_mem (knapsack), while
> also taking advantage of any path keys, etc.
>
> With Disk-based Hash Aggregation, in principle we can generate paths
> representing any combination of hashing and sorting for the grouping
> sets. But that would be overkill (and grow to a huge number of paths if
> we have more than a handful of grouping sets). So I think the existing
> planner logic for grouping sets is fine for now. We might come up with
> a better approach later.
>
> But that created a testing problem, because if the planner estimates
> correctly, no hashed grouping sets will spill, and the spilling code
> won't be exercised. This GUC makes the planner disregard which grouping
> sets' hash tables will fit, making it much easier to exercise the
> spilling code. Is there a better way I should be testing this code
> path?
>
>
So, I was catching up on email and noticed the last email in this
thread.

I think I am not fully understanding what enable_groupingsets_hash_disk
does. Is it only for testing?

Using the tests you added to src/test/regress/sql/groupingsets.sql, I
did get a plan that looks like hashagg is spilling to disk (goes through
hashagg_spill_tuple() code path and has number of batches reported in
Explain) in a MixedAgg plan for a grouping sets query even with
enable_groupingsets_hash_disk set to false. You don't have the exact
query I tried (below) in the test suite, but it is basically what is
already there, so I must be missing something.

set enable_hashagg_disk = true;
SET enable_groupingsets_hash_disk = false;
SET work_mem='64kB';
set enable_hashagg = true;
set jit_above_cost = 0;
drop table if exists gs_hash_1;
create table gs_hash_1 as
select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
(select g%1000 as g1000, g%100 as g100, g%10 as g10, g
from generate_series(0,199999) g) s
group by cube (g1000,g100,g10);

explain (analyze, costs off, timing off)
select g1000, g100, g10
from gs_hash_1 group by cube (g1000,g100,g10);

QUERY PLAN
--------------------------------------------------------------
MixedAggregate (actual rows=9648 loops=1)
Hash Key: g10
Hash Key: g10, g1000
Hash Key: g100
Hash Key: g100, g10
Group Key: g1000, g100, g10
Group Key: g1000, g100
Group Key: g1000
Group Key: ()
Peak Memory Usage: 233 kB
Disk Usage: 1600 kB
HashAgg Batches: 2333
-> Sort (actual rows=4211 loops=1)
Sort Key: g1000, g100, g10
Sort Method: external merge Disk: 384kB
-> Seq Scan on gs_hash_1 (actual rows=4211 loops=1)

Anyway, when I throw in the stats trick that is used in join_hash.sql:

alter table gs_hash_1 set (autovacuum_enabled = 'false');
update pg_class set reltuples = 10 where relname = 'gs_hash_1';

I get a MixedAgg plan that doesn't have any Sort below and uses much
more disk.

QUERY PLAN
----------------------------------------------------------
MixedAggregate (actual rows=4211 loops=1)
Hash Key: g1000, g100, g10
Hash Key: g1000, g100
Hash Key: g1000
Hash Key: g100, g10
Hash Key: g100
Hash Key: g10, g1000
Hash Key: g10
Group Key: ()
Peak Memory Usage: 405 kB
Disk Usage: 59712 kB
HashAgg Batches: 4209
-> Seq Scan on gs_hash_1 (actual rows=200000 loops=1)

I'm not sure if this is more what you were looking for--or maybe I am
misunderstanding the guc.

--
Melanie Plageman

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Justin Pryzby 2020-06-10 02:15:44 Re: Default setting for enable_hashagg_disk
Previous Message PG Doc comments form 2020-06-09 21:35:41 Signed-ness of ints is unclear in FE-BE protocol docs

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2020-06-10 01:50:58 Re: Is it useful to record whether plans are generic or custom?
Previous Message Andres Freund 2020-06-10 00:12:54 Re: elog(DEBUG2 in SpinLocked section.