Memory-Bounded Hash Aggregation

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Memory-Bounded Hash Aggregation
Date: 2019-07-01 19:13:53
Message-ID: 507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is for design review. I have a patch (WIP) for Approach 1, and if
this discussion starts to converge on that approach I will polish and
post it.

Let's start at the beginning: why do we have two strategies -- hash
and sort -- for aggregating data? The two are more similar than they
first appear. A partitioned hash strategy writes randomly among the
partitions, and later reads the partitions sequentially; a sort will
write sorted runs sequentially, but then read the among the runs
randomly during the merge phase. A hash is a convenient small
representation of the data that is cheaper to operate on; sort uses
abbreviated keys for the same reason.

Hash offers:

* Data is aggregated on-the-fly, effectively "compressing" the amount
of data that needs to go to disk. This is particularly important
when the data contains skewed groups (see below).

* Can output some groups after the first pass of the input data even
if other groups spilled.

* Some data types only support hashing; not sorting.

Sort+Group offers:

* Only one group is accumulating at once, so if the transition state
grows (like with ARRAY_AGG), it minimizes the memory needed.

* The input may already happen to be sorted.

* Some data types only support sorting; not hashing.

Currently, Hash Aggregation is only chosen if the optimizer believes
that all the groups (and their transition states) fit in
memory. Unfortunately, if the optimizer is wrong (often the case if the
input is not a base table), the hash table will
keep growing beyond work_mem, potentially bringing the entire system
to OOM. This patch fixes that problem by extending the Hash
Aggregation strategy to spill to disk when needed.

Previous discussions:

https://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop

https://www.postgresql.org/message-id/1419326161.24895.13.camel%40jeff-desktop

https://www.postgresql.org/message-id/87be3bd5-6b13-d76e-5618-6db0a4db584d%40iki.fi

A lot was discussed, which I will try to summarize and address here.

Digression: Skewed Groups:

Imagine the input tuples have the following grouping keys:

0, 1, 0, 2, 0, 3, 0, 4, ..., 0, N-1, 0, N

Group 0 is a skew group because it consists of 50% of all tuples in
the table, whereas every other group has a single tuple. If the
algorithm is able to keep group 0 in memory the whole time until
finalized, that means that it doesn't have to spill any group-0
tuples. In this example, that would amount to a 50% savings, and is a
major advantage of Hash Aggregation versus Sort+Group.

High-level approaches:

1. When the in-memory hash table fills, keep existing entries in the
hash table, and spill the raw tuples for all new groups in a
partitioned fashion. When all input tuples are read, finalize groups
in memory and emit. Now that the in-memory hash table is cleared (and
memory context reset), process a spill file the same as the original
input, but this time with a fraction of the group cardinality.

2. When the in-memory hash table fills, partition the hash space, and
evict the groups from all partitions except one by writing out their
partial aggregate states to disk. Any input tuples belonging to an
evicted partition get spilled to disk. When the input is read
entirely, finalize the groups remaining in memory and emit. Now that
the in-memory hash table is cleared, process the next partition by
loading its partial states into the hash table, and then processing
its spilled tuples.

3. Use some kind of hybrid[1][2] of hashing and sorting.

Evaluation of approaches:

Approach 1 is a nice incremental improvement on today's code. The
final patch may be around 1KLOC. It's a single kind of on-disk data
(spilled tuples), and a single algorithm (hashing). It also handles
skewed groups well because the skewed groups are likely to be
encountered before the hash table fills up the first time, and
therefore will stay in memory.

Approach 2 is nice because it resembles the approach of Hash Join, and
it can determine whether a tuple should be spilled without a hash
lookup. Unfortunately, those upsides are fairly mild, and it has
significant downsides:

* It doesn't handle skew values well because it's likely to evict
them.

* If we leave part of the hash table in memory, it's difficult to
ensure that we will be able to actually use the space freed by
eviction, because the freed memory may be fragmented. That could
force us to evict the entire in-memory hash table as soon as we
partition, reducing a lot of the benefit of hashing.

* It requires eviction for the algorithm to work. That may be
necessary for handling cases like ARRAY_AGG (see below) anyway, but
this approach constrains the specifics of eviction.

Approach 3 is interesting because it unifies the two approaches and
can get some of the benfits of both. It's only a single path, so it
avoids planner mistakes. I really like this idea and it's possible we
will end up with approach 3. However:

* It requires that all data types support sorting, or that we punt
somehow.

* Right now we are in a weird state because hash aggregation cheats,
so it's difficult to evaluate whether Approach 3 is moving us in the
right direction because we have no other correct implementation to
compare against. Even if Approach 3 is where we end up, it seems
like we should fix hash aggregation as a stepping stone first.

* It means we have a hash table and sort running concurrently, each
using memory. Andres said this might not be a problem[3], but I'm
not convinced that the problem is zero. If you use small work_mem
for the write phase of sorting, you'll end up with a lot of runs to
merge later and that has some kind of cost.

* The simplicity might start to evaporate when we consider grouping
sets and eviction strategy.

Main topics to consider:

ARRAY_AGG:

Some aggregates, like ARRAY_AGG, have a transition state that grows
proportionally with the group size. In other words, it is not a
summary like COUNT or AVG, it contains all of the input data in a new
form.

These aggregates are not a good candidate for hash aggregation. Hash
aggregation is about keeping many transition states running in
parallel, which is just a bad fit for large transition states. Sorting
is better because it advances one transition state at a time. We could:

* Let ARRAY_AGG continue to exceed work_mem like today.

* Block or pessimize use of hash aggregation for such aggregates.

* Evict groups from the hash table when it becomes too large. This
requires the ability to serialize and deserialize transition states,
and some approaches here might also need combine_func
specified. These requirements seem reasonable, but we still need
some answer of what to do for aggregates that grow like ARRAY_AGG
but don't have the required serialfunc, deserialfunc, or
combine_func.

GROUPING SETS:

With grouping sets, there are multiple hash tables and each hash table
has it's own hash function, so that makes partitioning more
complex. In Approach 1, that means we need to either (a) not partition
the spilled tuples; or (b) have a different set of partitions for each
hash table and spill the same tuple multiple times. In Approach 2, we
would be required to partition each hash table separately and spill
tuples multiple times. In Approach 3 (depending on the exact approach
but taking a guess here) we would need to add a set of phases (one
extra phase for each hash table) for spilled tuples.

MEMORY TRACKING:

I have a patch to track the total allocated memory by
incrementing/decrementing it when blocks are malloc'd/free'd. This
doesn't do bookkeeping for each chunk, only each block. Previously,
Robert Haas raised some concerns[4] about performance, which were
mitigated[5] but perhaps not entirely eliminated (but did become
elusive).

The only alternative is estimation, which is ugly and seems like a bad
idea. Memory usage isn't just driven by inputs, it's also driven by
patterns of use. Misestimates in the planner are fine (within reason)
because we don't have any other choice, and a small-factor misestimate
might not change the plan anyway. But in the executor, a small-factor
misestimate seems like it's just not doing the job. If a user found
that hash aggregation was using 3X work_mem, and my only explanation
is "well, it's just an estimate", I would be pretty embarrassed and
the user would likely lose confidence in the feature. I don't mean
that we must track memory perfectly everywhere, but using an estimate
seems like a mediocre improvement of the current state.

We should proceed with memory context tracking and try to eliminate or
mitigate performance concerns. I would not like to make any hurculean
effort as a part of the hash aggregation work though; I think it's
basically just something a memory manager in a database system should
have supported all along. I think we will find other uses for it as
time goes on. We have more and more things happening in the executor
and having a cheap way to check "how much memory is this thing using?"
seems very likely to be useful.

Other points:

* Someone brought up the idea of using logtapes.c instead of writing
separate files for each partition. That seems reasonable, but it's
using logtapes.c slightly outside of its intended purpose. Also,
it's awkward to need to specify the number of tapes up-front. Worth
experimenting with to see if it's a win.

* Tomas did some experiments regarding the number of batches to choose
and how to choose them. It seems like there's room for improvement
over ths simple calculation I'm doing now.

* A lot of discussion about a smart eviction strategy. I don't see
strong evidence that it's worth the complexity at this time. The
smarter we try to be, the more bookkeeping and memory fragmentation
problems we will have. If we evict something, we should probably
evict the whole hash table or some large part of it.

Regards,
Jeff Davis

[1]
https://postgr.es/m/20180604185205.epue25jzpavokupf%40alap3.anarazel.de
[2]
https://postgr.es/m/message-id/CAGTBQpa__-NP7%3DkKwze_enkqw18vodRxKkOmNhxAPzqkruc-8g%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/20180605175209.vavuqe4idovcpeie%40alap3.anarazel.de
[4]
https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/1413422787.18615.18.camel%40jeff-desktop

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashwin Agrawal 2019-07-01 19:14:37 Re: Zedstore - compressed in-core columnar storage
Previous Message Ashwin Agrawal 2019-07-01 19:08:06 Re: Zedstore - compressed in-core columnar storage