Re: Default setting for enable_hashagg_disk

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Default setting for enable_hashagg_disk
Date: 2020-06-10 18:39:02
Message-ID: 8bff2e4e8020c3caa16b61a46918d21b573eaf78.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

On Tue, 2020-06-09 at 21:15 -0500, Justin Pryzby wrote:
> The behavior of the GUC is inconsistent with the other GUCs, which is
> confusing. See also Robert's comments in this thread.
>
https://www.postgresql.org/message-id/20200407223900.GT2228%40telsasoft.com

enable_* GUCs are planner GUCs, so it would be confusing to me if they
affected execution-time behavior.

I think the point of confusion is that it's not enabling/disabling an
entire execution node; it only "disables" HashAgg if it thinks it will
spill. I agree that is a difference with the other GUCs, and could
cause confusion.

Stepping back, I was trying to solve two problems with these GUCs:

1. Testing the spilling of hashed grouping sets: I'm inclined to just
get rid of enable_groupingsets_hash_disk and use Melanie's stats-
hacking approach instead.

2. Trying to provide an escape hatch for someone who experiences a
performance regression and wants something like the old behavior back.
There are two aspects of the old behavior that a user could potentially
want back:
a. Don't choose HashAgg if it's expected to have more groups than fit
into a work_mem-sized hashtable.
b. If executing HashAgg, and the hash table exceeds work_mem, just
keep going.

The behavior in v13 master is, by default, analagous to Sort or
anything else that adapts at runtime to spill. If we had spillable
HashAgg the whole time, we wouldn't be worried about #2 at all. But,
out of conservatism, I am trying to accommodate users who want an
escape hatch, at least for a release or two until users feel more
comfortable with disk-based HashAgg.

Setting enable_hash_disk=false implements 2(a). This name apparently
causes confusion, but it's hard to come up with a better one because
the v12 behavior has nuance that's hard to express succinctly. I don't
think the names you suggested quite fit, but the idea to use a more
interesting GUC value might help express the behavior. Perhaps making
enable_hashagg a ternary "enable_hashagg=on|off|avoid_disk"? The word
"reject" is too definite for the planner, which is working with
imperfect information.

In master, there is no explicit way to get 2(b), but you can just set
work_mem higher in a lot of cases. If enough people want 2(b), I can
add it easily. Perhaps hashagg_overflow=on|off, which would control
execution time behavior?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Jeff Davis 2020-06-10 18:52:22 Re: Default setting for enable_hashagg_disk
Previous Message Jeff Davis 2020-06-10 17:39:08 Re: Default setting for enable_hashagg_disk

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-06-10 18:52:22 Re: Default setting for enable_hashagg_disk
Previous Message Alexey Kondratov 2020-06-10 17:57:17 Re: Physical replication slot advance is not persistent