Re: Memory-Bounded Hash Aggregation

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2019-12-05 01:24:00
Message-ID: CAAKRu_YgA1rEND4mb_3e2980vRY_Q5Y23o9FfMB1MrKGLgvmiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
> >On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
> >> Right now the patch always initializes 32 spill partitions. Have you
> >> given
> >> any thought into how to intelligently pick an optimal number of
> >> partitions yet?
> >
> >Attached a new patch that addresses this.
> >
> >1. Divide hash table memory used by the number of groups in the hash
> >table to get the average memory used per group.
> >2. Multiply by the number of groups spilled -- which I pessimistically
> >estimate as the number of tuples spilled -- to get the total amount of
> >memory that we'd like to have to process all spilled tuples at once.
>
> Isn't the "number of tuples = number of groups" estimate likely to be
> way too pessimistic? IIUC the consequence is that it pushes us to pick
> more partitions than necessary, correct?

> Could we instead track how many tuples we actually consumed for the the
> in-memory groups, and then use this information to improve the estimate
> of number of groups? I mean, if we know we've consumed 1000 tuples which
> created 100 groups, then we know there's ~1:10 ratio.
>

What would the cost be of having many small partitions? Some of the
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.

Also, number of groups = number of tuples is only for re-spilling.
This is a little bit unclear from the variable naming.

It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.

>
> 4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
> making the hash tables smaller is desirable - it may be, but if that was
> generally the case we'd just use small hash tables all the time. It's a
> bit annoying to give user the capability to set work_mem and then kinda
> override that.
>
> * ... Another benefit of having more, smaller partitions is that small
> * hash tables may perform better than large ones due to memory caching
> * effects.
>
>
So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...

>
> 11) The hash_spill_npartitions naming seems a bit confusing, because it
> seems to imply it's about the "spill" while in practice it just choses
> number of spill partitions. Maybe hash_choose_num_spill_partitions would
> be better?
>
>
Agreed that a name with "choose" or "calculate" as the verb would be
more clear.

>
> 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
> reasoning behind the current value (256)? Not wanting to pick too many
> partitions? Comment?
>
> if (npartitions > HASH_MAX_PARTITIONS)
> npartitions = HASH_MAX_PARTITIONS;
>
>
256 actually seems very large. hash_spill_npartitions() will be called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?

Melanie & Adam Lee

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2019-12-05 01:28:03 Re: could not stat promote trigger file leads to shutdown
Previous Message Yugo Nagata 2019-12-05 01:19:51 Re: Implementing Incremental View Maintenance