Re: 9.5: Better memory accounting, towards memory-bounded HashAgg

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Better memory accounting, towards memory-bounded HashAgg
Date: 2015-01-07 19:07:13
Message-ID: 54AD83E1.1090109@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 23.12.2014 10:16, Jeff Davis wrote:
> It seems that these two patches are being reviewed together. Should
> I just combine them into one? My understanding was that some wanted
> to review the memory accounting patch separately.
>
> On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
>> That's the only conflict, and after fixing it it compiles OK.
>> However, I got a segfault on the very first query I tried :-(
>
> If lookup_hash_entry doesn't find the group, and there's not enough
> memory to create it, then it returns NULL; but the caller wasn't
> checking for NULL. My apologies for such a trivial mistake, I was
> doing most of my testing using DISTINCT. My fix here was done
> quickly, so I'll take a closer look later to make sure I didn't miss
> something else.
>
> New patch attached (rebased, as well).

I did a review today, using these two patches:

* memory-accounting-v9.patch (submitted on December 2)
* hashagg-disk-20141222.patch

I started with some basic performance measurements comparing hashagg
queries without and with the patches (while you compared hashagg and
sort). That's IMHO an interesting comparison, especially when no
batching is necessary - in the optimal case the users should not see any
slowdown (we shouldn't make them pay for the batching unless it actually
is necessary).

So I did this:

drop table if exists test_hash_agg;

create table test_hash_agg as
select
i AS a,
mod(i,1000000) AS b,
mod(i,100000) AS c,
mod(i,10000) AS d,
mod(i,1000) AS e,
mod(i,100) AS f
from generate_series(1,10000000) s(i);

vacuum (full, analyze) test_hash_agg;

i.e. a ~500MB table with 10M rows, and columns with different
cardinalities. And then queries like this:

select count(*) from (select a, count(a) from test_hash_agg
group by a) foo;

-- 10M groups (OOM)
select count(*) from (select a, array_agg(a) from test_hash_agg
group by a) foo;

-- 100 groups
select count(*) from (select f, array_agg(f) from test_hash_agg
group by f) foo;

which performed quite well, i.e. I've seen absolutely no slowdown.
Which, in the array_agg case, is quite is quite suspicious, because on
every lookup_hash_entry call, it has to do MemoryContextMemAllocated()
on 10M contexts, and I really doubt that can be done in ~0 time.

So I started digging in the code and I noticed this:

hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);

which is IMHO obviously wrong, because that accounts only for the
hashtable itself. It might be correct for aggregates with state passed
by value, but it's incorrect for state passed by reference (e.g.
Numeric, arrays etc.), because initialize_aggregates does this:

oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
pergroupstate->transValue = datumCopy(peraggstate->initValue,
peraggstate->transtypeByVal,
peraggstate->transtypeLen);
MemoryContextSwitchTo(oldContext);

and it's also wrong for all the user-defined aggretates that have no
access to hashcontext at all, and only get reference to aggcontext using
AggCheckCallContext(). array_agg() is a prime example.

In those cases the patch actually does no memory accounting and as
hashcontext has no child contexts, there's no accounting overhead.

After fixing this bug (replacing hashcontext with aggcontext in both
calls to MemoryContextMemAllocated) the performance drops dramatically.
For the query with 100 groups (not requiring any batching) I see this:

test=# explain analyze select count(x) from (select f, array_agg(1) AS x
from test_hash_agg group by f) foo;

QUERY PLAN (original patch)
------------------------------------------------------------------------
Aggregate (cost=213695.57..213695.58 rows=1 width=32)
(actual time=2539.156..2539.156 rows=1 loops=1)
-> HashAggregate (cost=213693.07..213694.32 rows=100 width=4)
(actual time=2492.264..2539.012 rows=100 loops=1)
Group Key: test_hash_agg.f
Batches: 1 Memory Usage: 24kB Disk Usage:0kB
-> Seq Scan on test_hash_agg
(cost=0.00..163693.71 rows=9999871 width=4)
(actual time=0.022..547.379 rows=10000000 loops=1)
Planning time: 0.039 ms
Execution time: 2542.932 ms
(7 rows)

QUERY PLAN (fixed patch)
------------------------------------------------------------------------
Aggregate (cost=213695.57..213695.58 rows=1 width=32)
(actual time=5670.885..5670.885 rows=1 loops=1)
-> HashAggregate (cost=213693.07..213694.32 rows=100 width=4)
(actual time=5623.254..5670.803 rows=100 loops=1)
Group Key: test_hash_agg.f
Batches: 1 Memory Usage: 117642kB Disk Usage:0kB
-> Seq Scan on test_hash_agg
(cost=0.00..163693.71 rows=9999871 width=4)
(actual time=0.014..577.924 rows=10000000 loops=1)
Planning time: 0.103 ms
Execution time: 5677.187 ms
(7 rows)

So the performance drops 2x. With more groups, the performance impact is
even worse. For example with the first query (with 10M groups), this is
what I get in perf:

explain analyze select count(x) from (select a, array_agg(1) AS x from
test_hash_agg group by a) foo;

PerfTop: 4671 irqs/sec kernel:11.2% exact: 0.0%
------------------------------------------------------------------------

87.07% postgres [.] MemoryContextMemAllocated
1.63% [zcommon] [k] fletcher_4_native
1.60% [kernel] [k] acpi_processor_ffh_cstate_enter
0.68% [kernel] [k] xhci_irq
0.30% ld-2.19.so [.] _dl_sysinfo_int80
0.30% [kernel] [k] memmove
0.26% [kernel] [k] ia32_syscall
0.16% libglib-2.0.so.0.3800.2 [.] 0x000000000008c52a
0.15% libpthread-2.19.so [.] pthread_mutex_lock

and it runs indefinitely (I gave up after a few minutes). I believe this
renders the proposed memory accounting approach dead.

Granted, the 10M groups is a bit extreme example, but the first query
with 100 groups certainly is not.

I understand the effort to avoid the 2% regression measured by Robert on
a PowerPC machine, but I don't think that's a sufficient reason to cause
so much trouble to everyone using array_agg() or user-defined aggregates
based on the same 'create subcontext' pattern.

Especially when the reindex may be often improved by using more
maintenance_work_mem, but there's nothing you can do to improve this (if
it's not batching, then increasing work_mem will do nothing).

The array_agg() patch I submitted to this CF would fix this particular
query, as it removes the child contexts (so there's no need for
recursion in MemoryContextMemAllocated), but it does nothing to the
user-defined aggregates out there. And it's not committed yet.

Also, ISTM this makes it rather unusable as a general accounting
approach. If mere 100 subcontexts results in 2x slowdown, then a handful
of subcontexts will certainly have a measurable impact if we decide to
use this somewhere else (and not just in hash aggregate).

So I was curious how the accounting mechanism I proposed (parallel
MemoryAccounting hierarchy next to MemoryContext) would handle this, so
I've used it instead of the memory-accounting-v9.patch. And I measured
no difference compared to master (no slowdown at all).

I also tried array_agg() queries that actually require batchning, e.g.

select a, array_agg(1) x from test_hash_agg group by a;

which produces 10M groups, each using a separate 8kB context, which
amounts to 80GB in total. With work_mem=1GB this should proceed just
fine with ~80 batches. In practice, it runs indefinitely (again, I lost
patience after a few minutes), and I see this:

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 281,00 0,00 140784,00
1002,02 143,25 512,57 0,00 512,57 3,56 100,00

tomas(at)rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ du -s
374128 .
tomas(at)rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | wc -l
130
tomas(at)rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | head
celkem 372568
-rw------- 1 tomas users 2999480 7. led 19.46 pgsql_tmp23267.2637
-rw------- 1 tomas users 2973520 7. led 19.46 pgsql_tmp23267.2638
-rw------- 1 tomas users 2978800 7. led 19.46 pgsql_tmp23267.2639
-rw------- 1 tomas users 2959880 7. led 19.46 pgsql_tmp23267.2640
-rw------- 1 tomas users 3010040 7. led 19.46 pgsql_tmp23267.2641
-rw------- 1 tomas users 3083520 7. led 19.46 pgsql_tmp23267.2642
-rw------- 1 tomas users 3053160 7. led 19.46 pgsql_tmp23267.2643
-rw------- 1 tomas users 3044360 7. led 19.46 pgsql_tmp23267.2644
-rw------- 1 tomas users 3014000 7. led 19.46 pgsql_tmp23267.2645

That is, there are ~130 files, each ~3MB large, ~370MB in total. But the
system does ~140MB/s writes all the time. Also, the table has ~500MB in
total.

So either I'm missing something, but there's some sort of bug.

Attached you can find a bunch of files:

* 0001-memory-accounting.patch (my memory accounting)
* 0002-hashagg-patch.patch (jeff's patch, for completeness)
* 0003-context-fix.patch (fix hashcontext -> aggcontext)
* test.sql (data / queries I used for testing)

regards
Tomas

Attachment Content-Type Size
0001-memory-accounting.patch text/x-diff 13.0 KB
0002-hashagg-patch.patch text/x-diff 40.2 KB
0003-context-fix.patch text/x-diff 1.2 KB
test.sql application/sql 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-01-07 19:42:23 Re: INSERT ... ON CONFLICT UPDATE and RLS
Previous Message Mike Blackwell 2015-01-07 18:51:49 Re: [PATCH] explain sortorder