Re: Default setting for enable_hashagg_disk

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-07-25 17:40:50
Message-ID: 7d250899c8f2fcbf8030db4e778f39033b94c4ee.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Thu, 2020-07-23 at 19:33 -0700, Peter Geoghegan wrote:
> That does make it sound like the costs of the hash agg aren't being
> represented. I suppose it isn't clear if this is a costing issue
> because it isn't clear if the execution time performance itself is
> pathological or is instead something that must be accepted as the
> cost
> of spilling the hash agg in a general kind of way.

I have a feeling that this is mostly a costing problem. Sort uses its
memory in two different phases:

1. when writing the sorted runs, it needs the memory to hold the run
before sorting it, and only a single buffer for the output tape; and
2. when merging, it needs a lot of read buffers

But in HashAgg, it needs to hold all of the groups in memory *at the
same time* as it needs a lot of output buffers (one for each
partition). This doesn't matter a lot at high values of work_mem,
because the buffers will only be 8MB at most.

I did attempt to cost this properly: hash_agg_set_limits() takes into
account the memory the partitions will use, and the remaining memory is
what's used in cost_agg(). But there's a lot of room for error in
there.

If someone sees an obvious error in the costing, please let me know.
Otherwise, I think it will just take some time to make it better
reflect reality in a variety of cases. For v13, and we will either need
to live with it, or pessimize the costing for HashAgg until we get it
right.

Many costing issues can deal with a lot of slop -- e.g. HashJoin vs
MergeJoin -- because a small factor often doesn't make the difference
between plans. But HashAgg and Sort are more competitive with each
other, so costing needs to be more precise.

Regards,
Jeff Davis

In response to

Browse pgsql-docs by date

  From Date Subject
Next Message Peter Geoghegan 2020-07-25 18:05:02 Re: Default setting for enable_hashagg_disk
Previous Message Jeff Davis 2020-07-25 17:23:43 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-07-25 17:49:10 Re: Difference for Binary format vs Text format for client-server communication
Previous Message Jeff Janes 2020-07-25 17:30:22 Re: estimation problems for DISTINCT ON with FDW