Re: NOT IN subquery optimization

From: Richard Guo <riguo(at)pivotal(dot)io>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: "Li, Zheng" <zhelli(at)amazon(dot)com>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NOT IN subquery optimization
Date: 2019-02-26 10:24:36
Message-ID: CAN_9JTwZUsWLymsGA+BAJiGyqL5-gpEhu5zRF3gG9grCgmoCWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greenplum Database does this optimization. The idea is to use a new join
type, let's call it JOIN_LASJ_NOTIN, and its semantic regarding NULL is
defined as below:

1. If there is a NULL in the outer side, and the inner side is empty, the
NULL should be part of the outputs.

2. If there is a NULL in the outer side, and the inner side is not empty,
the NULL should not be part of the outputs.

3. If there is a NULL in the inner side, no outputs should be produced.

An example plan looks like:

gpadmin=# explain (costs off) select * from t1 where a not in(select a
from t2);
QUERY PLAN
-----------------------------------
Hash Left Anti Semi (Not-In) Join
Hash Cond: (t1.a = t2.a)
-> Seq Scan on t1
-> Hash
-> Seq Scan on t2
(5 rows)

Thanks
Richard

On Tue, Feb 26, 2019 at 7:20 AM David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
wrote:

> On Tue, 26 Feb 2019 at 11:51, Li, Zheng <zhelli(at)amazon(dot)com> wrote:
> > Resend the patch with a whitespace removed so that "git apply patch"
> works directly.
>
> I had a quick look at this and it seems to be broken for the empty
> table case I mentioned up thread.
>
> Quick example:
>
> Setup:
>
> create table t1 (a int);
> create table t2 (a int not null);
> insert into t1 values(NULL),(1),(2);
>
> select * from t1 where a not in(select a from t2);
>
> Patched:
> a
> ---
> 1
> 2
> (2 rows)
>
> Master:
> a
> ---
>
> 1
> 2
> (3 rows)
>
> This will be due to the fact you're adding an a IS NOT NULL qual to
> the scan of a:
>
> postgres=# explain select * from t1 where a not in(select a from t2);
> QUERY PLAN
> ------------------------------------------------------------------
> Hash Anti Join (cost=67.38..152.18 rows=1268 width=4)
> Hash Cond: (t1.a = t2.a)
> -> Seq Scan on t1 (cost=0.00..35.50 rows=2537 width=4)
> Filter: (a IS NOT NULL)
> -> Hash (cost=35.50..35.50 rows=2550 width=4)
> -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4)
> (6 rows)
>
> but as I mentioned, you can't do that as t2 might be empty and there's
> no way to know that during planning.
>
> --
> David Rowley
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com_&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=5r3cnfZPUDOHrMiXq8Mq2g&m=dE1nglE17x3nD-oH_BrF0r4SLaFnQKzwwJBJGpDoaaA&s=dshupMomMvkDAd92918cU21AJ1E1s7QwbrxIGSRxZA8&e=
> PostgreSQL Development, 24x7 Support, Training & Services
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2019-02-26 10:47:13 Re: Segfault when restoring -Fd dump on current HEAD
Previous Message Masahiko Sawada 2019-02-26 10:20:00 Re: [HACKERS] Block level parallel vacuum