Re: NOT IN subquery optimization

From: "Li, Zheng" <zhelli(at)amazon(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Richard Guo <riguo(at)pivotal(dot)io>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NOT IN subquery optimization
Date: 2019-03-01 23:39:05
Message-ID: DBD4724F-5466-40FB-AF28-4CEC9D4B6D1F@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The current transformation would not add "or s.a is NULL" in the example provided since it is non-nullable. You will be comparing these two cases in terms of the transformation:

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a);
Time: 51.416 ms

select count(*) from big b where a not in (select a from s);
Time: 890.088 ms

But if s.a is nullable, yes, you have proved my previous statement is false... I should have used almost always.
However, if s.a is nullable, we would do this transformation:
select count(*) from big b where not exists(select 1 from small s
where s.a = b.a or s.a is null);

It's possible to stop the nested loop join early during execution once we find an inner Null entry because every outer tuple is
going to evaluate to true on the join condition.

Zheng

On 3/1/19, 6:17 PM, "David Rowley" <david(dot)rowley(at)2ndquadrant(dot)com> wrote:

On Sat, 2 Mar 2019 at 12:13, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> "Li, Zheng" <zhelli(at)amazon(dot)com> writes:
> > Although adding "or var is NULL" to the anti join condition forces the planner to choose nested loop anti join, it is always faster compared to the original plan.
>
> TBH, I am *really* skeptical of sweeping claims like that. The existing
> code will typically produce a hashed-subplan plan, which ought not be
> that awful as long as the subquery result doesn't blow out memory.
> It certainly is going to beat a naive nested loop.

It's pretty easy to show the claim is false using master and NOT EXISTS.

create table small(a int not null);
create table big (a int not null);
insert into small select generate_Series(1,1000);
insert into big select x%1000+1 from generate_Series(1,1000000) x;

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a);
Time: 178.575 ms

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a or s.a is null);
Time: 38049.969 ms (00:38.050)


--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-01 23:46:43 Re: Infinity vs Error for division by zero
Previous Message Chapman Flack 2019-03-01 23:38:39 Re: Infinity vs Error for division by zero