Re: Memory-Bounded Hash Aggregation

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2019-12-14 17:32:25
Message-ID: 20191214173225.3kzsbncr6brpxdll@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
>On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
>> Right now the patch always initializes 32 spill partitions. Have you
>> given
>> any thought into how to intelligently pick an optimal number of
>> partitions yet?
>
>Attached a new patch that addresses this.
>
>1. Divide hash table memory used by the number of groups in the hash
>table to get the average memory used per group.
>2. Multiply by the number of groups spilled -- which I pessimistically
>estimate as the number of tuples spilled -- to get the total amount of
>memory that we'd like to have to process all spilled tuples at once.
>3. Divide the desired amount of memory by work_mem to get the number of
>partitions we'd like to have such that each partition can be processed
>in work_mem without spilling.
>4. Apply a few sanity checks, fudge factors, and limits.
>
>Using this runtime information should be substantially better than
>using estimates and projections.
>
>Additionally, I removed some branches from the common path. I think I
>still have more work to do there.
>
>I also rebased of course, and fixed a few other things.
>

I've done a bit more testing on this, after resolving a couple of minor
conflicts due to recent commits (rebased version attached).

In particular, I've made a comparison with different dataset sizes,
group sizes, GUC settings etc. The script and results from two different
machines are available here:

* https://bitbucket.org/tvondra/hashagg-tests/src/master/

The script essentially runs a simple grouping query with different
number of rows, groups, work_mem and parallelism settings. There's
nothing particularly magical about it.

I did run it both on master and patched code, allowing us to compare
results and assess impact of the patch. Overall, the changes are
expected and either neutral or beneficial, i.e. the timing are the same
or faster.

The number of cases that regressed is fairly small, but sometimes the
regressions are annoyingly large - up to 2x in some cases. Consider for
example this trivial example with 100M rows:

CREATE TABLE t AS
SELECT (100000000 * random())::int AS a
FROM generate_series(1,100000000) s(i);

On the master, the plan with default work_mem (i.e. 4MB) and

SET max_parallel_workers_per_gather = 8;

looks like this:

EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;

QUERY PLAN
----------------------------------------------------------------------------------------------------
Limit (cost=16037474.49..16037474.49 rows=1 width=12)
-> Finalize GroupAggregate (cost=2383745.73..16037474.49 rows=60001208 width=12)
Group Key: t.a
-> Gather Merge (cost=2383745.73..14937462.25 rows=100000032 width=12)
Workers Planned: 8
-> Partial GroupAggregate (cost=2382745.59..2601495.66 rows=12500004 width=12)
Group Key: t.a
-> Sort (cost=2382745.59..2413995.60 rows=12500004 width=4)
Sort Key: t.a
-> Parallel Seq Scan on t (cost=0.00..567478.04 rows=12500004 width=4)
(10 rows)

Which kinda makes sense - we can't do hash aggregate, because there are
100M distinct values, and that won't fit into 4MB of memory (and the
planner knows about that).

And it completes in about 108381 ms, give or take. With the patch, the
plan changes like this:

EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;

QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=2371037.74..2371037.74 rows=1 width=12)
-> HashAggregate (cost=1942478.48..2371037.74 rows=42855926 width=12)
Group Key: t.a
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4)
(4 rows)

i.e. it's way cheaper than the master plan, it's not parallel, but when
executed it takes much longer (about 147442 ms). After forcing a
parallel query (by setting parallel_setup_cost = 0) the plan changes to
a parallel one, but without a partial aggregate, but it's even slower.

The explain analyze for the non-parallel plan looks like this:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2371037.74..2371037.74 rows=1 width=12) (actual time=160180.718..160180.718 rows=0 loops=1)
-> HashAggregate (cost=1942478.48..2371037.74 rows=42855926 width=12) (actual time=54462.728..157594.756 rows=63215980 loops=1)
Group Key: t.a
Memory Usage: 4096kB Batches: 8320 Disk Usage:4529172kB
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4) (actual time=0.014..12198.044 rows=100000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 160183.517 ms
(7 rows)

So the cost is about 7x lower than for master, but the duration is much
higher. I don't know how much of this is preventable, but it seems there
might be something missing in the costing, because when I set work_mem to
1TB on the master, and I tweak the n_distinct estimates for the column
to be exactly the same on the two clusters, I get this:

master:
-------

SET work_mem = '1TB';
EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;

QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=2574638.28..2574638.28 rows=1 width=12)
-> HashAggregate (cost=1942478.48..2574638.28 rows=63215980 width=12)
Group Key: t.a
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4)
(4 rows)

patched:
--------

EXPLAIN SELECT * FROM (SELECT a, count(*) FROM t GROUP BY a OFFSET 1000000000) foo;

QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=2574638.28..2574638.28 rows=1 width=12)
-> HashAggregate (cost=1942478.48..2574638.28 rows=63215980 width=12)
Group Key: t.a
-> Seq Scan on t (cost=0.00..1442478.32 rows=100000032 width=4)
(4 rows)

That is, the cost is exactly the same, except that in the second case we
expect to do quite a bit of batching - there are 8320 batches (and we
know that, because on master we'd not use hash aggregate without the
work_mem tweak).

So I think we're not costing the batching properly / at all.

A couple more comments:

1) IMHO we should rename hashagg_mem_overflow to enable_hashagg_overflow
or something like that. I think that describes the GUC purpose better
(and it's more consistent with enable_hashagg_spill).

2) show_hashagg_info

I think there's a missing space after ":" here:

" Batches: %d Disk Usage:%ldkB",

and maybe we should use just "Disk:" just like in we do for sort:

-> Sort (actual time=662.136..911.558 rows=1000000 loops=1)
Sort Key: t2.a
Sort Method: external merge Disk: 13800kB

3) I'm not quite sure what to think about the JIT recompile we do for
EEOP_AGG_INIT_TRANS_SPILLED etc. I'm no llvm/jit expert, but do we do
that for some other existing cases?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
hashagg-20191210.diff text/plain 87.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-12-14 20:03:23 Re: tuplesort test coverage
Previous Message Fabien COELHO 2019-12-14 12:49:08 Re: psql's \watch is broken