Re: Memory-Bounded Hash Aggregation

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2020-02-19 19:16:36
Message-ID: 20200219191636.gvdywx32kwbix6kd@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've started reviewing the 20200218 version of the patch. In general it
seems fine, but I have a couple minor comments and two crashes.

1) explain.c currently does this:

I wonder if we could show something for plain explain (without analyze).
At least the initial estimate of partitions, etc. I know not showing
those details until after execution is what e.g. sort does, but I find
it a bit annoying.

A related comment is that maybe this should report also the initial
number of partitions, not just the total number. With just the total
it's impossible to say if there were any repartitions, etc.

2) The ExecBuildAggTrans comment should probably explain "spilled".

3) I wonder if we need to invent new opcodes? Wouldn't it be simpler to
just add a new flag to the agg_* structs instead? I haven't tried hacking
this, so maybe it's a silly idea.

4) lookup_hash_entries says

/* check to see if we need to spill the tuple for this grouping set */

But that seems bogus, because AFAIK we can't spill tuples for grouping
sets. So maybe this should say just "grouping"?

5) Assert(nbuckets > 0);

I was curious what happens in case of extreme skew, when a lot/all rows
consistently falls into a single partition. So I did this:

create table t (a int, b real);

insert into t select i, random()
from generate_series(-2000000000, 2000000000) s(i)
where mod(hashint4(i), 16384) = 0;

analyze t;

set work_mem = '64kB';
set max_parallel_workers_per_gather = 0;
set enable_sort = 0;

explain select a, sum(b) from t group by a;

QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=23864.26..31088.52 rows=244631 width=8)
Group Key: a
-> Seq Scan on t (cost=0.00..3529.31 rows=244631 width=8)
(3 rows)

This however quickly fails on this assert in BuildTupleHashTableExt (see
backtrace1.txt):

Assert(nbuckets > 0);

The value is computed in hash_choose_num_buckets, and there seem to be
no protections against returning bogus values like 0. So maybe we should
return

Min(nbuckets, 1024)

or something like that, similarly to hash join. OTOH maybe it's simply
due to agg_refill_hash_table() passing bogus values to the function?

6) Another thing that occurred to me was what happens to grouping sets,
which we can't spill to disk. So I did this:

create table t2 (a int, b int, c int);

-- run repeatedly, until there are about 20M rows in t2 (1GB)
with tx as (select array_agg(a) as a, array_agg(b) as b
from (select a, b from t order by random()) foo),
ty as (select array_agg(a) AS a
from (select a from t order by random()) foo)
insert into t2 select unnest(tx.a), unnest(ty.a), unnest(tx.b)
from tx, ty;

analyze t2;

This produces a table with two independent columns, skewed the same as
the column t.a. I don't know which of this actually matters, considering
grouping sets don't spill, so maybe the independence is sufficient and
the skew may be irrelevant?

And then do this:

set work_mem = '200MB';
set max_parallel_workers_per_gather = 0;
set enable_sort = 0;

explain select a, b, sum(c) from t2 group by cube (a,b);;

QUERY PLAN
---------------------------------------------------------------------
MixedAggregate (cost=0.00..833064.27 rows=2756495 width=16)
Hash Key: a, b
Hash Key: a
Hash Key: b
Group Key: ()
-> Seq Scan on t2 (cost=0.00..350484.44 rows=22750744 width=12)
(6 rows)

which fails with segfault at execution time:

tuplehash_start_iterate (tb=0x18, iter=iter(at)entry=0x2349340)
870 for (i = 0; i < tb->size; i++)
(gdb) bt
#0 tuplehash_start_iterate (tb=0x18, iter=iter(at)entry=0x2349340)
#1 0x0000000000654e49 in agg_retrieve_hash_table_in_memory ...

That's not surprising, because 0x18 pointer is obviously bogus. I guess
this is simply an offset 18B added to a NULL pointer?

Disabling hashagg spill (setting both GUCs to off) makes no difference,
but on master it fails like this:

ERROR: out of memory
DETAIL: Failed on request of size 3221225472 in memory context "ExecutorState".

which is annoying, but expected with an under-estimate and hashagg. And
much better than just crashing the whole cluster.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
backtrace1.txt text/plain 4.3 KB
backtrace2.txt text/plain 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-02-19 19:42:36 Re: Resolving the python 2 -> python 3 mess
Previous Message Peter Geoghegan 2020-02-19 19:16:12 Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.