Quick Links

Re: Relids in upper relations

From:	David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Relids in upper relations
Date:	2016-10-09 23:26:51
Message-ID:	CAKJS1f_=1jBrc+DzAomNdDvwvv+SOHBqee4xB6m4OugSrHtDQA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 10 October 2016 at 11:33, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
>> FROM order_line, order WHERE order_line.order_id = order.order_id
>> GROUP BY 1,2;
>
>> Doing the aggregation step first is likely to be much faster than
>> doing the join first here,
>
> Please provide some reason to believe that. It's the nature of an
> aggregate that it's sensitive to the number of rows going through it,
> with only a tiny number of exceptions (and SUM ain't one). So you could
> only push it down past joins that won't change the number of rows the
> aggregate will process, and how is that going to make it any faster?

I think there's a flaw in Robert's example because the GROUP BY has
columns from either side of the joins, however it's very simple to
come up with a real world case which aggregate push downs can improve.
We can simulate the aggregate push down with a simple subquery.

create table sale (sale_id int primary key,product_id int not
null,quantity int not null);
create table product (product_id int primary key,productdesc
varchar(32) not null);

insert into sale select x,x%100+1,(random()*10)::int+1 from
generate_series(1,10000000) x(x);
insert into product select x,'ABC'||x::text from generate_Series(1,100) x(x);

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,sum(s.quantity) qty from sale s inner join
product p on s.product_id = p.product_id group by p.product_id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
HashAggregate (cost=341558.25..341559.25 rows=100 width=17) (actual
rows=100 loops=1)
Group Key: p.product_id
-> Hash Join (cost=3.25..291558.25 rows=10000000 width=13)
(actual rows=10000000 loops=1)
Hash Cond: (s.product_id = p.product_id)
-> Seq Scan on sale s (cost=0.00..154055.00 rows=10000000
width=8) (actual rows=10000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Seq Scan on product p (cost=0.00..2.00 rows=100
width=9) (actual rows=100 loops=1)
Planning time: 0.308 ms
Execution time: 7568.927 ms
(10 rows)

Time: 7569.842 ms (00:07.570)

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,s.qty from (select product_id,sum(quantity)
qty from sale group by product_id) s inner join product p on
s.product_id = p.product_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Hash Join (cost=204058.25..204061.63 rows=100 width=17) (actual
rows=100 loops=1)
Hash Cond: (sale.product_id = p.product_id)
-> HashAggregate (cost=204055.00..204056.00 rows=100 width=12)
(actual rows=100 loops=1)
Group Key: sale.product_id
-> Seq Scan on sale (cost=0.00..154055.00 rows=10000000
width=8) (actual rows=10000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Seq Scan on product p (cost=0.00..2.00 rows=100 width=9)
(actual rows=100 loops=1)
Planning time: 0.610 ms
Execution time: 4267.145 ms
(10 rows)

So the pushed down version runs in 56% of the time of the non-pushed
down version. Of course, as mentioned by Robert, both versions would
have to be costed in case the join condition was highly selective

There's a good paper which goes into a bit more details on this
http://www.vldb.org/conf/1995/P345.PDF

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: Relids in upper relations at 2016-10-09 22:33:35 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2016-10-09 23:38:03	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Previous Message	Tom Lane	2016-10-09 22:33:35	Re: Relids in upper relations