Re: Hash support for grouping sets

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Mark Dilger <hornschnorter(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash support for grouping sets
Date: 2017-03-23 20:07:38
Message-ID: 87tw6j94y7.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Mark" == Mark Dilger <hornschnorter(at)gmail(dot)com> writes:

Mark> Is there a performance test case where this patch should shine
Mark> brightest? I'd like to load a schema with lots of data, and run
Mark> a grouping sets query, both before and after applying the patch,
Mark> to see what the performance advantage is.

The area which I think is most important for performance is the handling
of small cubes; without this patch, a 2d cube needs 2 full sorts, a 3d
one needs 3, and a 4d one needs 6. In many real-world data sets these
would all hash entirely in memory.

So here's a very simple example (deliberately using integers for
grouping to minimize the advantage; grouping by text columns in a non-C
locale would show a much greater speedup for the patch):

create table sales (
id serial,
product_id integer,
store_id integer,
customer_id integer,
qty integer);

-- random integer function
create function d(integer) returns integer language sql
as $f$ select floor(random()*$1)::integer + 1; $f$;

-- 10 million random rows
insert into sales (product_id,store_id,customer_id,qty)
select d(20), d(6), d(10), d(100) from generate_series(1,10000000);

-- example 2d cube:
select product_id, store_id, count(*), sum(qty)
from sales
group by cube(product_id, store_id);

-- example 3d cube:
select product_id, store_id, customer_id, count(*), sum(qty)
from sales
group by cube(product_id, store_id, customer_id);

-- example 4d cube with a computed column:
select product_id, store_id, customer_id, (qty / 10), count(*), sum(qty)
from sales
group by cube(product_id, store_id, customer_id, (qty / 10));

On my machine, the 2d cube is about 3.6 seconds with the patch, and
about 8 seconds without it; the 4d is about 18 seconds with the patch
and about 32 seconds without it (all with work_mem=1GB, compiled with
-O2 and assertions off).

--
Andrew (irc:RhodiumToad)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2017-03-23 20:08:52 Re: Potential data loss of 2PC files
Previous Message Mithun Cy 2017-03-23 19:52:34 Re: [POC] A better way to expand hash indexes.