Re: 9.5: Memory-bounded HashAgg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-17 16:39:03
Message-ID: 53F0DAA7.3000805@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10.8.2014 23:26, Jeff Davis wrote:
> This patch is requires the Memory Accounting patch, or something similar
> to track memory usage.
>
> The attached patch enables hashagg to spill to disk, which means that
> hashagg will contain itself to work_mem even if the planner makes a
> bad misestimate of the cardinality.
>
> This is a well-known concept; there's even a Berkeley homework
> assignment floating around to implement it -- in postgres 7.2, no
> less. I didn't take the exact same approach as the homework assignment
> suggests, but it's not much different, either. My apologies if some
> classes are still using this as a homework assignment, but postgres
> needs to eventually have an answer to this problem.
>
> Included is a GUC, "enable_hashagg_disk" (default on), which allows
> the planner to choose hashagg even if it doesn't expect the hashtable
> to fit in memory. If it's off, and the planner misestimates the
> cardinality, hashagg will still use the disk to contain itself to
> work_mem.
>
> One situation that might surprise the user is if work_mem is set too
> low, and the user is *relying* on a misestimate to pick hashagg. With
> this patch, it would end up going to disk, which might be
> significantly slower. The solution for the user is to increase
> work_mem.
>
> Rough Design:
>
> Change the hash aggregate algorithm to accept a generic "work item",
> which consists of an input file as well as some other bookkeeping
> information.
>
> Initially prime the algorithm by adding a single work item where the
> file is NULL, indicating that it should read from the outer plan.
>
> If the memory is exhausted during execution of a work item, then
> continue to allow existing groups to be aggregated, but do not allow new
> groups to be created in the hash table. Tuples representing new groups
> are saved in an output partition file referenced in the work item that
> is currently being executed.
>
> When the work item is done, emit any groups in the hash table, clear the
> hash table, and turn each output partition file into a new work item.
>
> Each time through at least some groups are able to stay in the hash
> table, so eventually none will need to be saved in output partitions, no
> new work items will be created, and the algorithm will terminate. This
> is true even if the number of output partitions is always one.
>
> Open items:
> * costing
> * EXPLAIN details for disk usage
> * choose number of partitions intelligently
> * performance testing
>
> Initial tests indicate that it can be competitive with sort+groupagg
> when the disk is involved, but more testing is required.
>
> Feedback welcome.

I've been working on this for a few hours - getting familiar with the
code, testing queries etc. Two comments.

1) Apparently there's something broken, because with this:

create table table_b (fk_id int, val_a int, val_b int);
insert into table_b
select i, mod(i,1000), mod(i,1000)
from generate_series(1,10000000) s(i);
analyze table_b;

I get this:

set work_mem = '8MB';
explain analyze select fk_id, count(*)
from table_b where val_a < 50 and val_b < 50 group by 1;
> The connection to the server was lost. Attempting reset: Failed.

Stacktrace attached, but apparently there's a segfault in
advance_transition_function when accessing pergroupstate.

This happened for all queries that I tried, once they needed to do
the batching.

2) Using the same hash value both for dynahash and batching seems
really fishy to me. I'm not familiar with dynahash, but I'd bet
the way it's done now will lead to bad distribution in the hash
table (some buckets will be always empty in some batches, etc.).

This is why hashjoin tries so hard to use non-overlapping parts
of the hash for batchno/bucketno.

The hashjoin implements it's onw hash table, which makes it clear
how the bucket is derived from the hash value. I'm not sure how
dynahash does that, but I'm pretty sure we can'd just reuse the hash
value like this.

I see two options - compute our own hash value, or somehow derive
a new one (e.g. by doing "hashvalue XOR random_seed"). I'm not sure
the latter would work, though.

regards
Tomas

Attachment Content-Type Size
gdb.log text/x-log 6.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-08-17 17:17:35 Re: 9.5: Memory-bounded HashAgg
Previous Message Anastasia Lubennikova 2014-08-17 16:15:38 Re: Index-only scans for GIST