Re: Memory-Bounded Hash Aggregation

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2020-01-25 01:01:35
Message-ID: e5566f7def33a9e9fdff337cca32d07155d7b635.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2020-01-08 at 12:38 +0200, Heikki Linnakangas wrote:
> This makes the assumption that all Aggrefs or GroupingFuncs are at
> the
> top of the TargetEntry. That's not true, e.g.:
>
> select 0+sum(a) from foo group by b;
>
> I think find_aggregated_cols() and find_unaggregated_cols() should
> be
> merged into one function that scans the targetlist once, and returns
> two
> Bitmapsets. They're always used together, anyway.

I cut the projection out for now, because there's some work in that
area in another thread[1]. If that work doesn't pan out, I can
reintroduce the projection logic to this one.

New patch attached.

It now uses logtape.c (thanks Adam for prototyping this work) instead
of buffile.c. This gives better control over the number of files and
the memory consumed for buffers, and reduces waste. It requires two
changes to logtape.c though:
* add API to extend the number of tapes
* lazily allocate buffers for reading (buffers for writing were
already allocated lazily) so that the total number of buffers needed at
any time is bounded

Unfortunately, I'm seeing some bad behavior (at least in some cases)
with logtape.c, where it's spending a lot of time qsorting the list of
free blocks. Adam, did you also see this during your perf tests? It
seems to be worst with lower work_mem settings and a large number of
input groups (perhaps there are just too many small tapes?).

It also has some pretty major refactoring that hopefully makes it
simpler to understand and reason about, and hopefully I didn't
introduce too many bugs/regressions.

A list of other changes:
* added test that involves rescan
* tweaked some details and tunables so that I think memory usage
tracking and reporting (EXPLAIN ANALYZE) is better, especially for
smaller work_mem
* simplified quite a few function signatures

Regards,
Jeff Davis

[1]
https://postgr.es/m/CAAKRu_Yj=Q_ZxiGX+pgstNWMbUJApEJX-imvAEwryCk5SLUebg@mail.gmail.com

Attachment Content-Type Size
hashagg-20200124.patch text/x-patch 117.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-01-25 01:16:47 Re: Memory-Bounded Hash Aggregation
Previous Message Maciek Sakrejda 2020-01-25 00:26:57 Re: Duplicate Workers entries in some EXPLAIN plans