Hash support for grouping sets

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Hash support for grouping sets
Date: 2017-01-06 03:48:58
Message-ID: 87vatszyhj.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Herewith a patch for doing grouping sets via hashing or mixed hashing
and sorting.

The principal objective is to pick whatever combination of grouping sets
has an estimated size that fits in work_mem, and minimizes the number of
sorting passes we need to do over the data, and hash those. (Yes, this
is a knapsack problem.)

This is achieved by adding another strategy to the Agg node; AGG_MIXED
means that both hashing and (sorted or plain) grouping happens. In
addition, support for multiple hash tables in AGG_HASHED mode is added.

Sample plans look like this:

explain select two,ten from onek group by cube(two,ten);
QUERY PLAN
--------------------------------------------------------------
MixedAggregate (cost=0.00..89.33 rows=33 width=8)
Hash Key: two, ten
Hash Key: two
Hash Key: ten
Group Key: ()
-> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=8)

explain select two,ten from onek group by two, cube(ten,twenty);
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=86.50..100.62 rows=162 width=12)
Hash Key: two, ten, twenty
Hash Key: two, ten
Hash Key: two
Hash Key: two, twenty
-> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=12)

set work_mem='64kB';
explain select count(*) from tenk1
group by grouping sets (unique1,thousand,hundred,ten,two);
QUERY PLAN
------------------------------------------------------------------------
MixedAggregate (cost=1535.39..3023.89 rows=11112 width=28)
Hash Key: two
Hash Key: ten
Hash Key: hundred
Group Key: unique1
Sort Key: thousand
Group Key: thousand
-> Sort (cost=1535.39..1560.39 rows=10000 width=20)
Sort Key: unique1
-> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20)
(10 rows)

Columns with unhashable or unsortable data types are handled
appropriately, though obviously every individual grouping set must end
up containing compatible column types.

There is one current weakness which I don't see a good solution for: the
planner code still has to pick a single value for group_pathkeys before
planning the input path. This means that we sometimes can't choose a
minimal set of sorts, because we may have chosen a sort order for a
grouping set that should have been hashed, for example:

explain select count(*) from tenk1
group by grouping sets (two,unique1,thousand,hundred,ten);
QUERY PLAN
------------------------------------------------------------------------
MixedAggregate (cost=1535.39..4126.28 rows=11112 width=28)
Hash Key: ten
Hash Key: hundred
Group Key: two
Sort Key: unique1
Group Key: unique1
Sort Key: thousand
Group Key: thousand
-> Sort (cost=1535.39..1560.39 rows=10000 width=20)
Sort Key: two
-> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20)

(3 sorts, vs. 2 in the previous example that listed the same grouping
sets in different order)

Current status of the work: probably still needs cleanup, maybe some
better regression tests, and Andres has expressed some concern over the
performance impact of additional complexity in advance_aggregates; it
would be useful if this could be tested. But it should be reviewable
as-is.

--
Andrew (irc:RhodiumToad)

Attachment Content-Type Size
gshash10.patch text/x-patch 140.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2017-01-06 03:50:52 Re: Faster methods for getting SPI results
Previous Message Tsunakawa, Takayuki 2017-01-06 03:01:57 Re: Supporting huge pages on Windows