Re: Memory-Bounded Hash Aggregation

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2020-03-26 09:56:56
Message-ID: CAMbWs4_W8fYbAn8KxgidAaZHON_Oo08OYn9ze=7remJymLqo5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:

> depth = ceil( log(nbatches - 1) / log(num_partitions) );

If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].

I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?

[1]
https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com

Thanks
Richard

On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

>
> Committed.
>
> There's some future work that would be nice (some of these are just
> ideas and may not be worth it):
>
> * Refactor MemoryContextMemAllocated() to be a part of
> MemoryContextStats(), but allow it to avoid walking through the blocks
> and freelists.
>
> * Improve the choice of the initial number of buckets in the hash
> table. For this patch, I tried to preserve the existing behavior of
> estimating the number of groups and trying to initialize with that many
> buckets. But my performance tests seem to indicate this is not the best
> approach. More work is needed to find what we should really do here.
>
> * For workloads that are not in work_mem *or* system memory, and need
> to actually go to storage, I see poor CPU utilization because it's not
> effectively overlapping CPU and IO work. Perhaps buffering or readahead
> changes can improve this, or async IO (even better).
>
> * Project unnecessary attributes away before spilling tuples to disk.
>
> * Improve logtape.c API so that the caller doesn't need to manage a
> bunch of tape numbers.
>
> * Improve estimate of the hash entry size. This patch doesn't change
> the way the planner estimates it, but I observe that actual size as
> seen at runtime is significantly different. This is connected to the
> initial number of buckets for the hash table.
>
> * In recursive steps, I don't have a good estimate for the number of
> groups, so I just estimate it as the number of tuples in that spill
> tape (which is pessimistic). That could be improved by doing a real
> cardinality estimate as the tuples are spilling (perhaps with HLL?).
>
> * Many aggregates with pass-by-ref transition states don't provide a
> great aggtransspace. We should consider doing something smarter, like
> having negative numbers represent a number that should be multiplied by
> the size of the group (e.g. ARRAY_AGG would have a size dependent on
> the group size, not a constant).
>
> * If we want to handle ARRAY_AGG (and the like) well, we can consider
> spilling the partial states in the hash table whem the memory is full.
> That would add a fair amount of complexity because there would be two
> types of spilled data (tuples and partial states), but it could be
> useful in some cases.
>
> Regards,
> Jeff Davis
>
>
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2020-03-26 10:06:45 Re: Negative cost is seen for plan node
Previous Message Sergei Kornilov 2020-03-26 09:53:24 Re: allow online change primary_conninfo