| From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
|---|---|
| To: | Richard Guo <guofenglinux(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Convert NOT IN sublinks to anti-joins when safe |
| Date: | 2026-02-04 14:59:22 |
| Message-ID: | 2df5912f-8888-4f27-9384-4c69ba22105c@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi!
On 04.02.2026 10:47, Richard Guo wrote:
> On Tue, Feb 3, 2026 at 4:12 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>> This topic has been discussed several times in the past. Due to the
>> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
>> equivalent to an anti-join. However, if we can prove that neither the
>> outer expressions nor the subquery outputs can yield NULL values, it
>> should be safe to convert NOT IN to an anti-join.
I used to play around with query rewrites of the same form, turning NOT
IN into NOT EXISTS. As you rightfully pointed out, the straight forward
rewrite only works if neither the outer expression nor the sub-query
cannot yield NULLs, limiting the optimization considerably. Both cases
can be fixed by amending the basic form of the rewrite.
Basic form:
SELECT t1.c1 FROM t1 WHERE t1.c1 NOT IN (SELECT t2.c1 FROM t2)
=>
SELECT t1.c1 FROM t1 WHERE
NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)
If the sub-select can yield NULLs, the rewrite can be fixed by adding an
OR t2.c1 IS NULL clause, such as:
SELECT t1.c1 FROM t1 WHERE
NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)
which is equivalent to the following SQL which avoids the OR:
SELECT t1.c1 FROM t1 WHERE
NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)
If the outer expression can yield NULLs, the rewrite can be fixed by
adding a t1.c1 IS NOT NULL clause, such as:
SELECT t1.c1 FROM T1 WHERE
t1.c1 IS NOT NULL AND
NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)
Both fixes can of course be combined yielding:
SELECT t1.c1 FROM T1 WHERE
t1.c1 IS NOT NULL AND
NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)
What's our today's take on doing more involved transformations inside
the planner to support such cases? It would greatly open up the scope of
the optimization.
> I've noticed a loose end in the v1 patch.
>
> The semantic gap between NOT IN and anti-join actually exists whenever
> the operator returns NULL. For NOT IN, if (A op B) returns NULL, then
> NOT (NULL) evaluates to NULL (effectively false), and the row is
> discarded. In contrast, for an anti-join, if (A op B) returns NULL,
> it implies no match was found, and the anti-join logic dictates that
> the row should be kept.
>
> To guarantee that (A op B) never returns NULL, the current patch
> verifies that both A and B are non-nullable. However, this is not
> sufficient. The "op" might be an operator that returns NULL on
> non-null inputs.
>
> On the other hand, if "op" does not return NULL on NULL inputs, like
> IS DISTINCT FROM, we technically would not even need to require that A
> and B are non-nullable.
>
> Is there a convenient way to verify that an operator never returns
> NULL on non-null inputs? Would it be sufficient to insist that the
> operator belongs to btree opclass (assuming that the strict ordering
> requirements of btree imply this safety)?
There's lots of code that e.g. calls FunctionCall2[Coll]() on operators,
where that function raises an ERROR in case the returned value is NULL.
Hence, I would say it's safe to make that assumption.
Otherwise, you could restrict the check to only builtin operators. For
those we know they have the expected semantics.
--
David Geier
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2026-02-04 15:06:12 | Re: Docs: Use non-default throughout the documentation |
| Previous Message | Peter Eisentraut | 2026-02-04 14:57:49 | Re: Show expression of virtual columns in error messages |