Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets

From: Tom Molesworth <tom(at)audioboundary(dot)com>
To: Ole Tange <ole(at)tange(dot)dk>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets
Date: 2009-07-28 22:16:37
Message-ID: 4A6F78C5.9070607@audioboundary.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Ole Tange wrote:
> On Tue, Jul 28, 2009 at 3:47 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> "Ole Tange" <postgresql(dot)org(at)tange(dot)dk> writes:
>>
>>> (modulo NULLs which seem to always cause problems in NOT INs).
>>>
>>> Because it can be rewritten, NOT IN should never be much slower than the
>>> rewritten solution, as PostgreSQL should simply rewrite NOT IN to the
>>> above.
>>>
>> Let's see, you understand that the rewrite violates the SQL standard
>> semantics of NOT IN, but you think we should do it anyway?
>>
>
> Thanks for your kind reply.
>
> Apparently my bug report was not quite clear. My rewrite example was
> simply to show a way that could do a marvelous speedup on medium to
> large sets. The correct dealing with NULL I am sure can be handled
> just as efficiently.
>

Just an observation - it seems that you're using NOT IN() and expecting
it to do the same as this:

SELECT foo FROM a LEFT JOIN b ON a.key = b.key WHERE b.key IS NULL

I find it's comparatively rare to actually want the NOT IN()
null-handling semantics, especially given your comment about nulls
'causing problems' (although I guess you could get the same behaviour in
that query as NOT IN, by adding 'AND a.key IS NOT NULL' to the where
clause - I have no idea whether this is something the optimiser could or
should be able to do).

Although it refers to TSQL, this page might help explain more about NOT IN:

http://stackoverflow.com/questions/129077/sql-not-in-constraint-and-null-values

Not sure if constructions such as 'not exists (select 1 from b where
b.key = a.key)' also have the same performance issues you describe -
should be faster due to the key? - but I find the left join approach
works well enough up to several million rows in the two tables. An
unindexed subquery like 'select key from a' just seems like a guaranteed
way to invoking a full (slow) table scan.

Tom

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jim Michaels 2009-07-28 22:57:50 BUG #4951: installation dir wrong for libpq compilation
Previous Message Ole Tange 2009-07-28 20:57:10 Re: BUG #4949: NOT IN is prohibitive slower than the rewrite for medium to large sets