Re: Memory-Bounded Hash Aggregation

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2020-02-14 02:01:38
Message-ID: CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 8, 2020 at 2:38 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> This makes the assumption that all Aggrefs or GroupingFuncs are at the
> top of the TargetEntry. That's not true, e.g.:
>
> select 0+sum(a) from foo group by b;
>
> I think find_aggregated_cols() and find_unaggregated_cols() should be
> merged into one function that scans the targetlist once, and returns two
> Bitmapsets. They're always used together, anyway.
>
>
So, I've attached a patch that does what Heikki recommended and gets
both aggregated and unaggregated columns in two different bitmapsets.
I think it works for more cases than the other patch.
I'm not sure it is the ideal interface, but, since there aren't many
consumers, I don't know.
Also, it needs some formatting/improved naming/etc.

Per Jeff's comment in [1] I started looking into using the scanCols
patch from the thread on extracting scanCols from PlannerInfo [2] to
get the aggregated and unaggregated columns for this patch.

Since we only make one bitmap for scanCols containing all of the
columns that need to be scanned, there is no context about where the
columns came from in the query.
That is, once the bit is set in the bitmapset, we have no way of
knowing if that column was needed for aggregation or if it is filtered
out immediately.

We could solve this by creating multiple bitmaps at the time that we
create the scanCols field -- one for aggregated columns, one for
unaggregated columns, and, potentially more if useful to other
consumers.

The initial problem with this is that we extract scanCols from the
PlannerInfo->simple_rel_array and PlannerInfo->simple_rte_array.
If we wanted more context about where those columns were from in the
query, we would have to either change how we construct the scanCols or
construct them early and add to the bitmap when adding columns to the
simple_rel_array and simple_rte_array (which, I suppose, is the same
thing as changing how we construct scanCols).

This might decentralize the code for the benefit of one consumer.
Also, looping through the simple_rel_array and simple_rte_array a
couple times per query seems like it would add negligible overhead.
I'm more hesitant to add code that, most likely, would involve a
walker to the codepath everybody uses if only agg will leverage the
two distinct bitmapsets.

Overall, I think it seems like a good idea to leverage scanCols for
determining what columns hashagg needs to spill, but I can't think of
a way of doing it that doesn't seem bad. scanCols are currently just
that -- columns that will need to be scanned.

[1]
https://www.postgresql.org/message-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel%40j-davis.com
[2]
https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DQ_ZxiGX%2BpgstNWMbUJApEJX-imvAEwryCk5SLUebg%40mail.gmail.com

--
Melanie Plageman

Attachment Content-Type Size
v1-0001-aggregated-unaggregated-cols-together.patch application/octet-stream 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-02-14 02:57:47 Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Previous Message Thomas Munro 2020-02-14 01:40:26 Re: error context for vacuum to include block number