Skip site navigation (1) Skip section navigation (2)

Re: IN or EXISTS

From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: IN or EXISTS
Date: 2011-08-31 01:33:28
Message-ID: 4E5D8F68.4010102@ringerc.id.au (view raw or flat)
Thread:
Lists: pgsql-performance
On 31/08/2011 4:30 AM, Andy Colson wrote:
> Hi all,
>
> I have read things someplace saying not exists was better than not 
> in... or something like that.  Not sure if that was for in/exists and 
> not in/not exists, and for a lot of records or not.
>
`EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary 
to build a list of values then iterate over them to check for a match. 
By contrast, `EXISTS' may use a simple index lookup or the like to test 
for the presence of a value.

On the other hand, the `IN' subquery is uncorrelated needs only run 
once, where the `EXISTS' subquery is correlated and has to run once for 
every outer record. That means that the `IN' list approach can be a lot 
faster where the subquery in question is relatively time consuming for 
the number of values it returns. For example, if the `IN' query returns 
only 5 values and takes 100ms, you're scanning 1 million records in the 
outer query, and the subquery `EXISTS' version would take 50ms, using 
`IN' is a no-brainer since 1 million times 50ms will be a lot slower 
than 1 times 100ms plus the time required to scan 5 elements 1 million 
times.

Another complication is the possible presence of NULL in an IN list. 
Getting NULLs in `IN' lists is a common source of questions on this 
list, because people are quite surprised by how it works. EXISTS avoids 
the NULL handling issue (and in the process demonstrates how woefully 
inconsistent SQL's handling of NULL really is).

Theoretically the query planner could transform:

SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE 
z.y_id IS NOT NULL);

into:

SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)

... or vice versa depending on which it thought would be faster. AFAIK 
it doesn't currently do this. To be able to do it the planner would need 
to know how to estimate the cost of scanning an `IN' result list. It'd 
also need to be able to use constraints on the target table to prove 
that the result of the `IN' may not contain nulls. To transform the 
EXISTS version into the IN version where it'd be more efficient, it'd 
also have to be able to use constraints on the target table to prove 
that results of a SELECT would be unique without explicit deduplication.

All this makes me wonder ... does Pg currently support sorting IN lists 
and using a binary search? It'd be pretty nice to be able to prove that:

SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z);

is equvalent to:

SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id 
IS NOT NULL)

... and either transform it to an EXISTS test or add an ORDER BY z_id 
and flag the resultset as sorted so a binary search could be done on it 
whenever a row hits the IN test.

--
Craig Ringer

In response to

Responses

pgsql-performance by date

Next:From: Mark KirkwoodDate: 2011-08-31 05:11:56
Subject: Re: 8.4 optimization regression?
Previous:From: Andy ColsonDate: 2011-08-30 20:30:23
Subject: IN or EXISTS

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group