| From: | wenhui qiu <qiuwenhuifx(at)gmail(dot)com> |
|---|---|
| To: | Richard Guo <guofenglinux(at)gmail(dot)com> |
| Cc: | Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Convert NOT IN sublinks to anti-joins when safe |
| Date: | 2026-02-04 03:53:08 |
| Message-ID: | CAGjGUAJh0bBaETV7Zin9J5pY++9ZEgXnPy--M66LUj3-nd63AQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Richard
That's just brilliant! Here’s my testing process, which also tackles a
commonly criticized aspect of PostgreSQL, effectively addressing its
shortcomings.Moreover, the path test scenarios are thorough, encompassing
cases that involve "NOT IN".Once again, thank you for your hard work.
###########
create table join1 (id integer,name varchar(300),k1 integer);
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values (
generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values (
generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values (
generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values (
generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values (
generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 WHERE T2.ID>10000);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=20060.65..58729.40 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72
rows=862371 width=4)
Index Cond: (id > 10000)
(5 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 and t2.id is not null);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=22216.57..60885.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..20060.65
rows=862371 width=4)
Index Cond: ((id > 10000) AND (id IS NOT NULL))
(5 rows)
postgres=# alter table join1 alter id set not null;
ALTER TABLE
postgres=# alter table join2 alter id set not null;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=28684.35..72424.36 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=17904.72..17904.72 rows=862371 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72
rows=862371 width=4)
Index Cond: (id > 10000)
(6 rows)
postgres=#
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8) (actual
time=239.777..602.433 rows=400200.00 loops=1)
Hash Cond: (t1.id = t2.id)
Buffers: shared hit=25304
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
(actual time=0.007..85.255 rows=1260300.00 loops=1)
Buffers: shared hit=22915
-> Hash (cost=16108.32..16108.32 rows=882100 width=4) (actual
time=220.949..220.951 rows=882100.00 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 39204kB
Buffers: shared hit=2389
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32
rows=882100 width=4) (actual time=0.008..103.306 rows=882100.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=2389
Planning:
Buffers: shared hit=12
Planning Time: 0.193 ms
Execution Time: 617.668 ms
(16 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=16108.32..16108.32 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32
rows=882100 width=4)
(5 rows)
postgres=# alter table join2 alter id drop not null ;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=18313.57..56982.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32
rows=882100 width=4)
(4 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=29339.83..73079.83 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=18313.58..18313.58 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..18313.58
rows=882100 width=4)
Index Cond: (id IS NOT NULL)
(6 rows)
postgres=# insert into join2 values (
generate_series(950300,1950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 1000001
postgres=# analyze join2;
ANALYZE
postgres=#
postgres=# insert into join2 values (
generate_series(1950300,3950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 2000001
postgres=# analyze join2;
ANALYZE
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.86..73003.87 rows=1 width=8) (actual
time=348.632..371.158 rows=200.00 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=99166 read=3398
-> Merge Anti Join (cost=0.86..72003.77 rows=1 width=8) (actual
time=177.574..334.714 rows=50.00 loops=4)
Merge Cond: (t1.id = t2.id)
Buffers: shared hit=99166 read=3398
-> Parallel Index Scan using idx_j1 on join1 t1
(cost=0.43..37380.06 rows=406548 width=8) (actual time=0.090..81.338
rows=315075.00 loops=4)
Index Searches: 1
Buffers: shared hit=85241 read=3398
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..80682.41
rows=3882102 width=4) (actual time=0.049..145.187 rows=1281766.50 loops=4)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
Index Searches: 4
Buffers: shared hit=13925
Planning:
Buffers: shared hit=12
Planning Time: 0.191 ms
Execution Time: 371.191 ms
postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)
postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)
postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)
postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)
postgres=# set work_mem ='2MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# ^C
postgres=#
postgres=# set work_mem ='2MB';
SET
postgres=# select now();
now
-------------------------------
2026-02-04 11:31:42.957009+08
(1 row)
postgres=# select now();
now
-------------------------------
2026-02-04 11:32:10.811249+08
(1 row)
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# select now();
now
-------------------------------
2026-02-04 11:42:29.698854+08
(1 row)
postgres=#
Thanks
On Tue, Feb 3, 2026 at 5:41 PM wenhui qiu <qiuwenhuifx(at)gmail(dot)com> wrote:
> Hi Richard
>
> > I believe we are now in a much better position to attempt this again.
> > The planner has accumulated significant infrastructure that makes this
> > proof straightforward and reliable. Specifically, we can now leverage
> > the outer-join-aware-Var infrastructure to tell whether a Var comes
> > from the nullable side of an outer join, and the not-null-attnums hash
> > table to efficiently check whether a Var is defined NOT NULL. We also
> > have the expr_is_nonnullable() function that is smart enough to deduce
> > non-nullability for expressions more complex than simple Vars/Consts.
> Thank you for working on this.Indeed, the benefits are substantial
> and highly necessary, as Oracle, SQL Server, and MySQL have all implemented
> varying degrees of support.I shall test this path in my spare time.
>
> Thanks ,
>
> On Tue, Feb 3, 2026 at 3:13 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>
>> This topic has been discussed several times in the past. Due to the
>> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
>> equivalent to an anti-join. However, if we can prove that neither the
>> outer expressions nor the subquery outputs can yield NULL values, it
>> should be safe to convert NOT IN to an anti-join.
>>
>> I believe we are now in a much better position to attempt this again.
>> The planner has accumulated significant infrastructure that makes this
>> proof straightforward and reliable. Specifically, we can now leverage
>> the outer-join-aware-Var infrastructure to tell whether a Var comes
>> from the nullable side of an outer join, and the not-null-attnums hash
>> table to efficiently check whether a Var is defined NOT NULL. We also
>> have the expr_is_nonnullable() function that is smart enough to deduce
>> non-nullability for expressions more complex than simple Vars/Consts.
>>
>> Attached is a draft patch for this attempt (part of the code is
>> adapted from an old patch [1] by David and Tom). This patch aims for
>> a conservative implementation: the goal is not to handle every
>> theoretical case, but to handle canonical query patterns with minimal
>> code complexity.
>>
>> The patch primarily targets patterns like:
>>
>> SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned_users);
>>
>> ... and
>>
>> SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM
>> banned_users WHERE user_id IS NOT NULL);
>>
>> This is a very typical syntax for exclusion. In well-modeled
>> databases, join keys like id and user_id are very likely to be defined
>> as NOT NULL.
>>
>> It seems to me that the ROI here is quite positive: the added code
>> complexity is very low (thanks to the existing infrastructure), while
>> the benefit is that users writing this typical pattern will finally
>> get efficient anti-join plans without needing manual rewrites.
>>
>> (For the outer expressions, we could potentially also use outer query
>> quals to prove non-nullability. This patch does not attempt to do so.
>> Implementing this would require passing state down during the
>> pull_up_sublinks recursion; and given that find_nonnullable_vars can
>> fail to prove non-nullability in many cases due to the lack of
>> const-simplification at this stage, I'm not sure whether it is worth
>> the code complexity. Besides, I haven't fully convinced myself that
>> doing this does not introduce correctness issues.)
>>
>> Any thoughts?
>>
>> [1] https://postgr.es/m/13766.1405037879@sss.pgh.pa.us
>>
>> - Richard
>>
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Corey Huinker | 2026-02-04 04:05:58 | Re: Add expressions to pg_restore_extended_stats() |
| Previous Message | Zane Duffield | 2026-02-04 03:40:04 | Re: Proposal: pg_createsubscriber use without superuser privileges |