Re: Pull up sublink of type 'NOT NOT (expr)'

From: Richard Guo <riguo(at)pivotal(dot)io>
To: Alexey Bashtanov <bashtanov(at)imap(dot)cc>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pull up sublink of type 'NOT NOT (expr)'
Date: 2018-10-25 07:42:05
Message-ID: CAN_9JTyyi0Q=Mis5T2KL0QCSAhBJ0aGqnZm2hYjQ0E25bnTU=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alex,

Yes hashed SubPlan preserves order and may be faster than hash join in some
cases. But I don't think that is a reason good enough to prevent the subplan
from being converted to join.

Let's suppose the subplan is uncorrelated, otherwise hashed SubPlan would
not
be used. Hashed SubPlan can only be applied when the size of the subquery
result can fit in work_mem. When the data size increases or work_mem is set
to
a smaller value that more than one batch is needed, hashed SubPlan will not
be
used and the total cost will increase dramatically. Hash join, by contrast,
handles multiple batches better.

Below is an example to show the behavior of hash join and hashed SubPlan for
single batch and multiple batches.

create table a (i int);
create table b (i int);
insert into a select i from generate_series(1,10000)i;
insert into b select i from generate_series(1,10000)i;

*For single batch:*

*Hash Join*
gpadmin=# explain analyze select * from a where a.i in (select b.i from b);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=270.00..552.50 rows=10000 width=4) (actual
time=4.332..9.499 rows=10000 loops=1)
Hash Cond: (a.i = b.i)
-> Seq Scan on a (cost=0.00..145.00 rows=10000 width=4) (actual
time=0.016..1.328 rows=10000 loops=1)
-> Hash (cost=145.00..145.00 rows=10000 width=4) (actual
time=4.292..4.292 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 480kB
-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=4) (actual
time=0.013..1.817 rows=10000 loops=1)
Planning Time: 0.236 ms
Execution Time: 10.099 ms
(8 rows)

*Hashed SubPlan*
gpadmin=# explain analyze select * from a where not not a.i in (select b.i
from b);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on a (cost=170.00..340.00 rows=5000 width=4) (actual
time=6.359..11.843 rows=10000 loops=1)
Filter: (hashed SubPlan 1)
SubPlan 1
-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=4) (actual
time=0.017..1.749 rows=10000 loops=1)
Planning Time: 0.109 ms
Execution Time: 12.905 ms
(6 rows)

insert into b select i from generate_series(1,100000)i;
insert into b select i from generate_series(1,100000)i;

*For multiple batches:*

*Hash Join*
gpadmin=# explain analyze select * from a where a.i in (select b.i from b);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=6476.00..7659.50 rows=10000 width=4) (actual
time=89.915..121.016 rows=10000 loops=1)
Hash Cond: (a.i = b.i)
-> Seq Scan on a (cost=0.00..145.00 rows=10000 width=4) (actual
time=0.020..1.779 rows=10000 loops=1)
-> Hash (cost=3030.00..3030.00 rows=210000 width=4) (actual
time=89.666..89.667 rows=210000 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2860kB
-> Seq Scan on b (cost=0.00..3030.00 rows=210000 width=4)
(actual time=0.015..37.686 rows=210000 loops=1)
Planning Time: 0.256 ms
Execution Time: 121.769 ms
(8 rows)

*Plain SubPlan*
gpadmin=# explain analyze select * from a where not not a.i in (select b.i
from b);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on a (cost=0.00..27130170.00 rows=5000 width=4) (actual
time=0.030..9994.036 rows=10000 loops=1)
Filter: (SubPlan 1)
SubPlan 1
-> Materialize (cost=0.00..4901.00 rows=210000 width=4) (actual
time=0.000..0.359 rows=5000 loops=10000)
-> Seq Scan on b (cost=0.00..3030.00 rows=210000 width=4)
(actual time=0.010..1.673 rows=10000 loops=1)
Planning Time: 0.077 ms
Execution Time: 9995.556 ms
(7 rows)

Thanks
Richard

On Wed, Oct 24, 2018 at 12:39 AM, Alexey Bashtanov <bashtanov(at)imap(dot)cc>
wrote:

> Hello Richard,
>
> Currently for quals in the form of "NOT NOT (SubLink)", this SubLink would
> not
> be considered when pulling up sublinks. For instance:
>
> gpadmin=# explain select * from a where NOT NOT (a.i in (select b.i from
> b));
> QUERY PLAN
> -------------------------------------------------------------
> Seq Scan on a (cost=51.50..85.62 rows=1005 width=8)
> Filter: (hashed SubPlan 1)
> SubPlan 1
> -> Seq Scan on b (cost=0.00..44.00 rows=3000 width=4)
> (4 rows)
>
>
> Should we give it a chance, like the attached does?
>
>
> Sometimes hashed subplan is faster than hash join and than all the other
> options, as it preserves the order.
> Using NOT NOT, one can control whether to use it or not:
> https://pgblog.bashtanov.com/2017/12/08/double-negative-
> and-query-performance/ (test case and results in the bottom of the page).
>
> Surely dirty tricks should not be the way to control the planner, but when
> breaking them we should probably provide a way to achieve the same result,
> ideally making the planner choose the best plan without hints.
>
> Best,
> Alex
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shay Rojansky 2018-10-25 07:53:14 UNLISTEN, DISCARD ALL and readonly standby
Previous Message Peter Eisentraut 2018-10-25 07:41:13 Re: Alter index rename concurrently to