PATCH: decreasing memory needlessly consumed by array_agg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: decreasing memory needlessly consumed by array_agg
Date: 2014-03-28 02:00:05
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


this is a patch for issue reported in October 2013 in pgsql-bugs:

Frank van Vugt reported that a simple query with array_agg() and large
number of groups (1e7) fails because of OOM even on machine with 32GB of

So for example doing this:

INSERT INTO test SELECT i, i FROM generate_series(1,10000000) s(i);

SELECT a, array_agg(b) FROM test GROUP BY a;

allocates huge amounts of RAM and easily forces the machine into
swapping and eventually gets killed by OOM (on my workstation with 8GB
of RAM that happens almost immediately).

Upon investigation, it seems caused by a combination of issues:

(1) per-group memory contexts - each group state uses a dedicated
memory context, which is defined like this (in accumArrayResult):

arr_context = AllocSetContextCreate(rcontext,

which actually means this

arr_context = AllocSetContextCreate(rcontext,

so each group will allocate at least 8kB of memory (of the first
palloc call). With 1e7 groups, that's ~80GB of RAM, even if each
group contains just 1 item.

(2) minimum block size in aset.c - The first idea I got was to decrease
the block size in the allocator. So I decreased it to 256B but I
was still getting OOM. Then I found that aset.c contains this:

if (initBlockSize < 1024)
initBlockSize = 1024;

so effectively the lowest allowed block size is 1kB. Which means
~10GB of memory for the state data (i.e. not considering overhead
of the hash table etc., which is not negligible).

Considering we're talking about 1e7 32-bit integers, i.e. 40MB
of raw data, that's still pretty excessive (250x more).

While I question whether the 1kB minimum block size makes sense, I
really think per-group memory contexts don't make much sense here. What
is the point of per-group memory contexts?

The memory will be allocated when the first row of the group is
received, and won't be allocated until the whole result set is
processed. At least that's how it works for Hash Aggregate.

However that's exactly how it would work with a single memory context,
which has the significant benefit that all the groups share the same
memory (so the minimum block size is not an issue).

That is exactly what the patch aims to do - it removes the per-group
memory contexts and reuses the main memory context of the aggregate

The patch also does one more thing - it changes how the arrays (in the
aggregate state) grow. Originally it worked like this

/* initial size */
astate->alen = 64;

/* when full, grow exponentially */
if (astate->nelems >= astate->alen)
astate->alen *= 2;

so the array length would grow like this 64 -> 128 -> 256 -> 512 ...
(note we're talking about elements, not bytes, so with with 32-bit
integers it's actually 256B -> 512B -> 1024B -> ...).

While I do understand the point of this (minimizing palloc overhead), I
find this pretty dangerous, especially in case of array_agg(). I've
modified the growth like this:

/* initial size */
astate->alen = 4;

/* when full, grow exponentially */
if (astate->nelems >= astate->alen)
astate->alen += 4;

I admit that might be a bit too aggressive, and maybe there's a better
way to do this - with better balance between safety and speed. I was
thinking about something like this:

/* initial size */
astate->alen = 4;

/* when full, grow exponentially */
if (astate->nelems >= astate->alen)
if (astate->alen < 128)
astate->alen *= 2;
astate->alen += 128;

i.e. initial size with exponential growth, but capped at 128B.


Attachment Content-Type Size
array-agg.patch text/x-diff 1.6 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message David Johnston 2014-03-28 03:00:56 Re: History of WAL_LEVEL (archive vs hot_standby)
Previous Message Noah Misch 2014-03-28 01:19:19 Re: History of WAL_LEVEL (archive vs hot_standby)