Re: Default setting for enable_hashagg_disk

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Rowley <dgrowleyml(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-06-25 23:53:57
Message-ID: 20200625235357.udlest4k6i2uekui@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Thu, Jun 25, 2020 at 01:17:56PM -0400, Bruce Momjian wrote:
>On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote:
>> On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> > I think my main point is that work_mem was not being honored for
>> > hash-agg before, but now that PG 13 can do it, we are again allowing
>> > work_mem not to apply in certain cases. I am wondering if our hard
>> > limit for work_mem is the issue, and we should make that more flexible
>> > for all uses.
>>
>> I mean, that's pretty much what we're talking about here, isn't it? It
>> seems like in your previous two replies you were opposed to separating
>> the plan-type limit from the execution-time limit, but that idea is
>> precisely a way of being more flexible (and extending it to other plan
>> nodes is a way of making it more flexible for more use cases).
>
>I think it is was Tom who was complaining about plan vs. execution time
>control.
>
>> As I think you know, if you have a system where the workload varies a
>> lot, you may sometimes be using 0 copies of work_mem and at other
>> times 1000 or more copies, so the value has to be chosen
>> conservatively as a percentage of system memory, else you start
>> swapping or the OOM killer gets involved. On the other hand, some plan
>> nodes get a lot less efficient when the amount of memory available
>> falls below some threshold, so you can't just set this to a tiny value
>> and forget about it. Because the first problem is so bad, most people
>> set the value relatively conservatively and just live with the
>> performance consequences. But this also means that they have memory
>> left over most of the time, so the idea of letting a node burst above
>> its work_mem allocation when something unexpected happens isn't crazy:
>> as long as only a few nodes do that here and there, rather than, say,
>> all the nodes doing it all at the same time, it's actually fine. If we
>> had a smarter system that could dole out more work_mem to nodes that
>> would really benefit from it and less to nodes where it isn't likely
>> to make much difference, that would be similar in spirit but even
>> better.
>
>I think the issue is that in PG 13 work_mem controls sorts and hashes
>with a new hard limit for hash aggregation:
>
> https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
>
> Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash
> tables are used in hash joins, hash-based aggregation, and hash-based
> processing of IN subqueries.
>
>In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if
>was expected it to exceed work_mem, but if we assumed it would be less
>than work_mem and it was more, we exceeded work_mem allocation for that
>node. In PG 13, we "limit" memory to work_mem and spill to disk if we
>exceed it.
>
>We should really have always documented that hash agg could exceed
>work_mem for misestimation, and if we add a hash_agg work_mem
>misestimation bypass setting we should document this setting in work_mem
>as well.
>

I don't think that would change anything, really. For the users the
consequences would be still exactly the same, and they wouldn't even be
in position to check if they are affected.

So just documenting that hashagg does not respect work_mem at runtime
would be nice, but it would not make any difference for v13, just like
documenting a bug is not really the same thing as fixing it.

>But then the question is why do we allow this bypass only for hash agg?
>Should work_mem have a settings for ORDER BY, merge join, hash join, and
>hash agg, e.g.:
>
> work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB'
>
>Yeah, crazy syntax, but you get the idea. I understand some nodes are
>more sensitive to disk spill than others, so shouldn't we be controlling
>this at the work_mem level, rather than for a specific node type like
>hash agg? We could allow for misestimation over allocation of hash agg
>work_mem by splitting up the hash agg values:
>
> work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB'
>
>but _avoiding_ hash agg if it is estimated to exceed work mem and spill
>to disk is not something to logically control at the work mem level,
>which leads so something like David Rowley suggested, but with different
>names:
>
> enable_hashagg = on | soft | avoid | off
>
>where 'on' and 'off' are the current PG 13 behavior, 'soft' means to
>treat work_mem as a soft limit and allow it to exceed work mem for
>misestimation, and 'avoid' means to avoid hash agg if it is estimated to
>exceed work mem. Both 'soft' and 'avoid' don't spill to disk.
>
>David's original terms of "trynospill" and "neverspill" were focused on
>spilling, not on its interaction with work_mem, and I found that
>confusing.
>
>Frankly, if it took me this long to get my head around this, I am
>unclear how many people will understand this tuning feature enough to
>actually use it.
>

Yeah. I agree with Andres we this may be a real issue, and that adding
some sort of "escape hatch" for v13 would be good. But I'm not convinced
adding a whole lot of new memory limits for every node that might spill
is the way to go. What exactly would be our tuning advice to users? Of
course, we could keep it set to work_mem by default, but we all know
engineers - we can't resist tuning a know when we get one.

I'm not saying it's not beneficial to use different limits for different
nodes. Some nodes are less sensitive to the size (e.g. sorting often
gets faster with smaller work_mem). But I think we should instead have a
per-session limit, and the planner should "distribute" the memory to
different nodes. It's a hard problem, of course.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Bruce Momjian 2020-06-26 04:02:10 Re: Default setting for enable_hashagg_disk
Previous Message Tomas Vondra 2020-06-25 23:23:16 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-06-26 00:18:22 Re: xid wraparound danger due to INDEX_CLEANUP false
Previous Message Peter Geoghegan 2020-06-25 23:28:22 Re: Enabling B-Tree deduplication by default