Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))

From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))
Date: 2011-07-26 19:32:11
Message-ID: 4E2F163B.6060105@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2011-07-22 17:35, Hitoshi Harada wrote:
> 2011/7/23 Yeb Havinga<yebhavinga(at)gmail(dot)com>:
>> Works like a charm :-). However, now there is always a copyObject of a
>> subquery even when the subquery is not safe for qual pushdown. The problem
>> with the previous issafe was that it was only assigned for
>> rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
>> still works. See attached patch that avoids subquery copy for unsafe
>> subqueries, and also exits best_inner_subqueryscan before palloc of
>> differenttypes in case of unsafe queries.
>
> Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
> I'll check it more.

A few days ago I read Tomas Vondra's blog post about dss tpc-h queries
on PostgreSQL at
http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in
which he showed how to manually pull up a dss subquery to get a large
speed up. Initially I thought: cool, this is probably now handled by
Hitoshi's patch, but it turns out the subquery type in the dss query is
different.

The original and rewritten queries are below. The debug_print_plan
output shows the subquery is called from a opexpr (< l_quantity,
subquery output) and the sublink type is EXPR_SUBLINK. Looking at the
source code; pull_up_sublink only considers ANY and EXISTS sublinks. I'm
wondering if this could be expanded to deal with EXPR sublinks. Clearly
in the example Tomas has given this can be done. I'm wondering if there
are show stoppers that prevent this to be possible in the general case,
but can't think of any, other than the case of a sublink returning NULL
and the opexpr is part of a larger OR expression or IS NULL; in which
case it should not be pulled op, or perhaps it could be pulled up as
outer join.

Thoughts?

regards,
Yeb

The original query:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < (
select
0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
)
LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Limit (cost=183345309.79..183345309.81 rows=1 width=8)
-> Aggregate (cost=183345309.79..183345309.81 rows=1 width=8)
-> Hash Join (cost=2839.99..183345307.76 rows=815 width=8)
Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
Join Filter: (public.lineitem.l_quantity < (SubPlan 1))
-> Seq Scan on lineitem (cost=0.00..68985.69
rows=2399869 width=17)
-> Hash (cost=2839.00..2839.00 rows=79 width=4)
-> Seq Scan on part (cost=0.00..2839.00 rows=79
width=4)
Filter: ((p_brand = 'Brand#13'::bpchar) AND
(p_container = 'JUMBO PKG'::bpchar))
SubPlan 1
-> Aggregate (cost=74985.44..74985.46 rows=1 width=5)
-> Seq Scan on lineitem (cost=0.00..74985.36
rows=31 width=5)
Filter: (l_partkey = part.p_partkey)

manually rewritten variant:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(SELECT l_partkey AS agg_partkey,
0.2 * avg(l_quantity) AS avg_quantity
FROM lineitem GROUP BY l_partkey) part_agg
where
p_partkey = l_partkey
and agg_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < avg_quantity
LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=179643.88..179643.89 rows=1 width=8)
-> Aggregate (cost=179643.88..179643.89 rows=1 width=8)
-> Hash Join (cost=161865.21..178853.91 rows=315985 width=8)
Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
Join Filter: (public.lineitem.l_quantity < ((0.2 *
avg(public.lineitem.l_quantity))))
-> HashAggregate (cost=80985.04..82148.65 rows=77574
width=9)
-> Seq Scan on lineitem (cost=0.00..68985.69
rows=2399869 width=9)
-> Hash (cost=80849.63..80849.63 rows=2444 width=21)
-> Hash Join (cost=2839.99..80849.63 rows=2444
width=21)
Hash Cond: (public.lineitem.l_partkey =
part.p_partkey)
-> Seq Scan on lineitem
(cost=0.00..68985.69 rows=2399869 width=17)
-> Hash (cost=2839.00..2839.00 rows=79
width=4)
-> Seq Scan on part
(cost=0.00..2839.00 rows=79 width=4)
Filter: ((p_brand =
'Brand#13'::bpchar) AND (p_container = 'JUMBO PKG'::bpchar))

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2011-07-26 19:34:13 Re: WIP: Fast GiST index build
Previous Message Tom Lane 2011-07-26 19:25:05 Re: sinval synchronization considered harmful