Re: Default setting for enable_hashagg_disk

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-07-25 18:05:02
Message-ID: CAH2-WzkhNFqSeTmzfYKNUi-6hXxH1uL=1yZLh2uEO18wWWn94g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Sat, Jul 25, 2020 at 10:23 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> There's also another effect at work that can cause the total number of
> batches to be higher for larger work_mem values: when we do recurse, we
> again need to estimate the number of partitions needed. Right now, we
> overestimate the number of partitions needed (to be conservative),
> which leads to a wider fan-out and lots of tiny partitions, and
> therefore more batches.

What worries me a bit is the sharp discontinuities when spilling with
significantly less work_mem than the "optimal" amount. For example,
with Tomas' TPC-H query (against my smaller TPC-H dataset), I find
that setting work_mem to 6MB looks like this:

-> HashAggregate (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=21039.788..32278.703 rows=2000000 loops=1)
Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
Group Key: lineitem.l_partkey
Planned Partitions: 128 Peak Memory Usage: 6161kB Disk
Usage: 2478080kB HashAgg Batches: 128

(And we have a sensible looking number of batches that match the
number of planned partitions with higher work_mem settings, too.)

However, if I set work_mem to 5MB (or less), it looks like this:

-> HashAggregate (cost=2700529.47..3020654.22 rows=1815500
width=40) (actual time=20849.490..37027.533 rows=2000000 loops=1)
Output: lineitem.l_partkey, (0.2 * avg(lineitem.l_quantity))
Group Key: lineitem.l_partkey
Planned Partitions: 128 Peak Memory Usage: 5393kB Disk
Usage: 2482152kB HashAgg Batches: 11456

So the number of partitions is still 128, but the number of batches
explodes to 11,456 all at once. My guess that this is because the
recursive hash aggregation misbehaves in a self-similar fashion once a
certain tipping point has been reached. I expect that the exact nature
of that tipping point is very complicated, and generally dependent on
the dataset, clustering, etc. But I don't think that this kind of
effect will be uncommon.

(FWIW this example requires ~620MB work_mem to complete without
spilling at all -- so it's kind of extreme, though not quite as
extreme as many of the similar test results from Tomas.)

--
Peter Geoghegan

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Jeff Davis 2020-07-25 20:10:36 Re: Default setting for enable_hashagg_disk
Previous Message Jeff Davis 2020-07-25 17:40:50 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-07-25 19:41:42 Re: hashagg slowdown due to spill changes
Previous Message Peter Eisentraut 2020-07-25 17:49:10 Re: Difference for Binary format vs Text format for client-server communication