Re: NOT IN subquery optimization

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: "Li, Zheng" <zhelli(at)amazon(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Richard Guo <riguo(at)pivotal(dot)io>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>
Subject: Re: NOT IN subquery optimization
Date: 2019-03-02 16:25:39
Message-ID: 18203.1551543939@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
> On Sat, 2 Mar 2019 at 13:45, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yeah --- that has a nontrivial risk of making things significantly worse,
>> which makes it a hard sell. I think the most reasonable bet here is
>> simply to not perform the transformation if we can't prove the inner side
>> NOT NULL. That's going to catch most of the useful cases anyway IMO.

> Did you mean outer side NOT NULL?

Sorry, sloppy thinking.

> Of course, the inner side needs to not produce NULLs either, but
> that's due to the fact that if a NULL exists in the inner side then
> the anti-join should not produce any records.

Right. So the path of least resistance is to transform to antijoin
only if we can prove both of those things (and maybe we need to check
that the join operator is strict, too? -ENOCAFFEINE). The question
before us is what is the cost-benefit ratio of trying to cope with
additional cases. I'm skeptical that it's attractive: the cost
certainly seems high, and I don't know that there are very many
real-world cases where we'd get a win.

Hmm ... thinking about the strictness angle some more: what we really
need to optimize NOT IN, IIUC, is an assumption that the join operator
will never return NULL. While not having NULL inputs is certainly a
*necessary* condition for that (assuming a strict operator) it's not a
*sufficient* condition. Any Postgres function/operator is capable
of returning NULL whenever it feels like it. So checking strictness
does not lead to a mathematically correct optimization.

My initial thought about plugging that admittedly-academic point is
to insist that the join operator be both strict and a member of a
btree opclass (hash might be OK too; less sure about other index types).
The system already contains assumptions that btree comparators never
return NULL. I doubt that this costs us any real-world applicability,
because if the join operator can neither merge nor hash, we're screwed
anyway for finding a join plan that's better than nested-loop.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2019-03-02 16:40:09 Re: GSoC 2019
Previous Message Stephen Frost 2019-03-02 16:08:16 Re: Online verification of checksums