Re: Expression Pruning in postgress

From: HarmeekSingh Bedi <harmeeksingh(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Expression Pruning in postgress
Date: 2011-07-07 17:53:08
Message-ID: CALLwk6vj0V959K_F2QQM=TsL7cGw3=tYh-JE7p4eyzDtu56GJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Tom. Here is a example. Just a background of things . I have made
changes in postgress execution and storage engine to make it a MPP style
engine - keeping all optimizer intact. Basically take pgress serial plan and
construct a parallel plan. The query I am running is below.

*Query*

# explain select ooj.local_time,ooj.is_timeout,ooj.capsulename,
regexp_split_to_table(
case
when ooj.isubstr is null then 'none'
when ooj.isubstr='' then 'none'
else ooj.isubstr
end ,';') as interest,
sum(ooj.impression) count_impressions,
sum(ooj.click) count_clicks
from (
select (impression.server_utc_time/3600000)*3600 as local_time,
impression.is_timeout_default_impression as is_timeout,
impression.capsule_name as capsulename,
substring(impression.website_variables from 'ybt=([0-9;]*)') as
isubstr,
1 as impression,
case click.impression_id when null then 0 else 1 end as click
from
impression_ytw2_row impression
left outer join clicks_row click
on impression.impression_id = click.impression_id) ooj
group by local_time, is_timeout, capsulename, interest;

*Now if you kindly bear with me and ignore the Parallel nodes in the plan
{which are just parallel distributors} . If you compare the two plans below
and check the output tupledesc of the Hash join you will see that some
columns are not gettign pruned out - In my case this is a big column.
*
*Case 1 Plan {inline view gets merged} *

Now when the inline view gets merged via pullup_subqueries I can see that
we have columns {impression.website_variables} which should get pruned out
but it does not after the JOIN { it gets pruned out after hash aggregate).

Parallel reciever RANDOM queue (nodenum=0 cost=132.79..133.12 rows=10
width=104)
Output: (sum(1)), impression.is_timeout_default_impression,
impression.capsule_name, (sum(1)), sum(1), sum(1)
-> Parallel sender RANDOM queue (nodenum=1 cost=132.79..133.12 rows=10
width=104)
Output: (sum(1)), impression.is_timeout_default_impression,
impression.capsule_name, (sum(1)), sum(1), sum(1)
-> HashAggregate (nodenum=2 cost=132.79..133.12 rows=10
width=104)
Output: (((impression.server_utc_time / 3600000) * 3600)),
impression.is_timeout_default_impression, impression.capsule_name,
(regexp_split_to_table(CASE WHEN
("substring"((impression.website_variables)::text, 'ybt=([0-9;]*)'::text) IS
NULL
THEN 'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text, 'ybt=([0-9;]*)'::text)
END, ';'::text)), sum(1), sum(1)
-> Parallel reciever HASH-AGG queue (nodenum=3
cost=117.45..130.08 rows=181 width=104)
Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name, *
impression.website_variables*, ((impression.server_utc_time / 3600000) *
3600), regexp_split_to_table(CASE WHEN ("substring"((impres
ion.website_variables)::text, 'ybt=([0-9;]*)'::text) IS NULL) THEN
'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text, 'ybt=([0-9;
*)'::text) END, ';'::text)
-> Parallel sender HASH-AGG queue (nodenum=4
cost=117.45..130.08 rows=181 width=104)
Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name, *
impression.website_variables*, ((impression.server_utc_time / 3600000) *
3600), regexp_split_to_table(CASE WHEN ("substring"((
mpression.website_variables)::text, 'ybt=([0-9;]*)'::text) IS NULL) THEN
'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text, 'ybt=
[0-9;]*)'::text) END, ';'::text)
* -> Hash Left Join (nodenum=5
cost=117.45..130.08 rows=181 width=104)
Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name,
impression.website_variables, ((impression.server_utc_time / 3600000) *
3600), regexp_split_to_table(CASE WHEN ("substr
ng"((impression.website_variables)::text, 'ybt=([0-9;]*)'::text) IS NULL)
THEN 'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text,*
'ybt=([0-9;]*)'::text) END, ';'::text)
-> Parallel reciever HASH-LEFT queue
(nodenum=6 cost=0.00..3.10 rows=10 width=136)
Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name,
impression.website_variables, impression.impression_id
-> Parallel sender HASH-LEFT queue
(nodenum=7 cost=0.00..3.10 rows=10 width=136)
Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
-> Seq Scan on
impression_ytw2_row impression (nodenum=8 cost=0.00..3.10 rows=10
width=136)
Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
-> Hash (nodenum=9 cost=72.20..72.20
rows=3620 width=32)
Output: click.impression_id
-> Parallel reciever HASH-RIGHT
queue (nodenum=10 cost=0.00..72.20 rows=3620 width=32)
Output: click.impression_id
-> Parallel sender HASH-RIGHT
queue (nodenum=11 cost=0.00..72.20 rows=3620 width=32)
Output:
click.impression_id
-> Seq Scan on
clicks_row click (nodenum=12 cost=0.00..72.20 rows=3620 width=32)
Output:
click.impression_id

*Case 2 { PLan when I disable code to merge view - basically do not call
pullup_subqueries - it does the right thing }

* Parallel reciever RANDOM queue (nodenum=0 cost=133.70..137.32 rows=181
width=112)
Output: ooj.local_time, ooj.is_timeout, ooj.capsulename, (),
sum((sum((sum(ooj.impression))))), sum((sum((sum(ooj.click)))))
-> Parallel sender RANDOM queue (nodenum=1 cost=133.70..137.32 rows=181
width=112)
Output: ooj.local_time, ooj.is_timeout, ooj.capsulename, (),
sum((sum(ooj.impression))), sum((sum(ooj.click)))
-> HashAggregate (nodenum=2 cost=133.70..137.32 rows=181
width=112)
Output: ooj.local_time, ooj.is_timeout, ooj.capsulename,
(regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text
WHEN (ooj.isubstr = ''::text) THEN 'none'::text ELSE ooj.isubstr END,
';'::text)), sum(ooj.impression), sum(oo
.click)
-> Parallel reciever HASH-AGG queue (nodenum=3
cost=117.45..130.98 rows=181 width=112)
Output: ooj.local_time, ooj.is_timeout,
ooj.capsulename, ooj.isubstr, ooj.impression, ooj.click,
regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text WHEN
(ooj.isubstr = ''::text) THEN 'none'::text ELSE ooj.isubstr
ND, ';'::text)
-> Parallel sender HASH-AGG queue (nodenum=4
cost=117.45..130.98 rows=181 width=112)
Output: ooj.local_time, ooj.is_timeout,
ooj.capsulename, ooj.isubstr, ooj.impression, ooj.click,
regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text WHEN
(ooj.isubstr = ''::text) THEN 'none'::text ELSE ooj.is
bstr END, ';'::text)
-> Subquery Scan ooj (nodenum=5
cost=117.45..130.98 rows=181 width=112)
Output: ooj.local_time, ooj.is_timeout,
ooj.capsulename, ooj.isubstr, ooj.impression, ooj.click,
regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text WHEN
(ooj.isubstr = ''::text) THEN 'none'::text ELSE
oj.isubstr END, ';'::text)
*-> Hash Left Join (nodenum=6
cost=117.45..128.27 rows=181 width=104)
Output: ((impression.server_utc_time
/ 3600000) * 3600), impression.is_timeout_default_impression,
impression.capsule_name, "substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text), 1, 1*
-> Parallel reciever HASH-LEFT
queue (nodenum=7 cost=0.00..3.10 rows=10 width=136)
Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
-> Parallel sender HASH-LEFT
queue (nodenum=8 cost=0.00..3.10 rows=10 width=136)
Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
-> Seq Scan on
impression_ytw2_row impression (nodenum=9 cost=0.00..3.10 rows=10
width=136)
Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
-> Hash (nodenum=10
cost=72.20..72.20 rows=3620 width=32)
Output: click.impression_id
-> Parallel reciever
HASH-RIGHT queue (nodenum=11 cost=0.00..72.20 rows=3620 width=32)
Output:
click.impression_id
-> Parallel sender
HASH-RIGHT queue (nodenum=12 cost=0.00..72.20 rows=3620 width=32)
Output:
click.impression_id
-> Seq Scan on
clicks_row click (nodenum=13 cost=0.00..72.20 rows=3620 width=32)
Output:
click.impression_id

*More analysis

*

1. Looked at code in make_join_rel and build_joinrel_tlist I can see in
one case the bms_nonempty_difference code kicks in and finds that the column
can be pruned out and other case it does not do the right thing.
2. I am still trying to work out why pullup_subquery should make a
difference.

Any pointers appreciated .

Regards
Harmeek

2011 at 7:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> HarmeekSingh Bedi <harmeeksingh(at)gmail(dot)com> writes:
>> In a plan where Node 1 is parent {say join) and Node 2 is child
>> (say scan) . If node 1 has a expression say foo(column) then scan
>> will project 'column' for sure and join will
>> evaluate foo(column). Now if the node above join does not need
>> column ? how does postgress remove the column from join's projection
>> list .
>
> See build_joinrel_tlist() in relnode.c.
>
>> I am seeing that it does not in many
>> cases specially when sort appears above.
>
> Please show a concrete example.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-07-07 18:11:19 Re: [RRR] 9.2 CF2: 20 days in
Previous Message Noah Misch 2011-07-07 17:42:24 Re: [v9.2] Fix leaky-view problem, part 1