Skip site navigation (1) Skip section navigation (2)

Re: Subquery WHERE IN or WHERE EXISTS faster?

From: Rusty Conover <rconover(at)infogears(dot)com>
To: Ulrich <ulrich(dot)mierendorff(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Subquery WHERE IN or WHERE EXISTS faster?
Date: 2008-06-30 06:50:27
Message-ID: B7AEEA21-79C7-481D-BB7B-899002CC2174@infogears.com (view raw or flat)
Thread:
Lists: pgsql-performance

On Jun 28, 2008, at 4:07 PM, Ulrich wrote:

> Hi,
> I have added a bit of dummy Data, 100000 processors, 10000 users,  
> each user got around 12 processors.
>
> I have tested both queries. First of all, I was surprised that it is  
> that fast :) Here are the results:
>
>
> EXPLAIN ANALYZE SELECT speed FROM processors WHERE id IN (SELECT  
> processorid FROM users_processors WHERE userid=4040) ORDER BY speed  
> ASC LIMIT 10 OFFSET 1;
>
> Limit  (cost=113.73..113.75 rows=7 width=5) (actual  
> time=0.335..0.340 rows=10 loops=1)
>  ->  Sort  (cost=113.73..113.75 rows=8 width=5) (actual  
> time=0.332..0.333 rows=11 loops=1)
>        Sort Key: processors.speed
>        Sort Method:  quicksort  Memory: 17kB
>        ->  Nested Loop  (cost=47.22..113.61 rows=8 width=5) (actual  
> time=0.171..0.271 rows=13 loops=1)
>              ->  HashAggregate  (cost=47.22..47.30 rows=8 width=4)  
> (actual time=0.148..0.154 rows=13 loops=1)
>                    ->  Bitmap Heap Scan on users_processors   
> (cost=4.36..47.19 rows=12 width=4) (actual time=0.074..0.117 rows=13  
> loops=1)
>                          Recheck Cond: (userid = 4040)
>                          ->  Bitmap Index Scan on  
> users_processors_userid_index  (cost=0.00..4.35 rows=12 width=0)  
> (actual time=0.056..0.056 rows=13 loops=1)
>                                Index Cond: (userid = 4040)
>              ->  Index Scan using processors_pkey on processors   
> (cost=0.00..8.28 rows=1 width=9) (actual time=0.006..0.007 rows=1  
> loops=13)
>                    Index Cond: (processors.id =  
> users_processors.processorid)
> Total runtime: 0.471 ms
> (13 rows)
>
> ___________
>
> EXPLAIN ANALYZE SELECT speed FROM processors WHERE EXISTS (SELECT 1  
> FROM users_processors WHERE userid=4040 AND  
> processorid=processors.id) ORDER BY speed ASC LIMIT 10 OFFSET 1;
>
> Limit  (cost=831413.86..831413.89 rows=10 width=5) (actual  
> time=762.475..762.482 rows=10 loops=1)
>  ->  Sort  (cost=831413.86..831538.86 rows=50000 width=5) (actual  
> time=762.471..762.473 rows=11 loops=1)
>        Sort Key: processors.speed
>        Sort Method:  quicksort  Memory: 17kB
>        ->  Seq Scan on processors  (cost=0.00..830299.00 rows=50000  
> width=5) (actual time=313.591..762.411 rows=13 loops=1)
>              Filter: (subplan)
>              SubPlan
>                ->  Index Scan using users_processors_pkey on  
> users_processors  (cost=0.00..8.29 rows=1 width=0) (actual  
> time=0.006..0.006 rows=0 loops=100000)
>                      Index Cond: ((userid = 4040) AND (processorid =  
> $0))
> Total runtime: 762.579 ms
> (10 rows)
>
>
>
>
> As you can see the second query is much slower. First I thought  
> "Just a difference of 0.3ms?", but then I realized that it was 762ms  
> not 0.762 ;-).
> Both queries return the same result, so I will use #1 and count(*)  
> takes just 0.478ms if I use query #1.
>


This is what I've found with tables ranging in the millions of rows.

Using IN is better when you've got lots of rows to check against the  
IN set and the IN set may be large and possibly complicated to  
retrieve (i.e. lots of joins, or expensive functions).

Postgres will normally build a hash table of the IN set and just  
search that hash table. It's especially fast if the entire hash table  
that is built can fit into RAM.  The cpu/io cost of building the IN  
set can be quite large because it needs to fetch every tuple to hash  
it, but this can be faster then searching tuple by tuple through  
possibly many indexes and tables like EXISTS does.  I like to increase  
work_mem a lot (512mb and up) if I know I'm going to be doing a lot of  
matches against a large IN set of rows because I'd prefer for that  
hash table to never to be written to disk.

EXISTS is better when you're doing fewer matches because it will pull  
the rows out one at a time from its query possibly using indexes, its  
main advantage is that it doesn't pull all of the tuples before it  
starts processing matches.

So in summary both are good to know how to use, but choosing which one  
to use can really depend on your data set and resources.

Cheers,

Rusty
--
Rusty Conover
InfoGears Inc.
http://www.infogears.com

In response to

Responses

pgsql-performance by date

Next:From: Rusty ConoverDate: 2008-06-30 06:59:16
Subject: Re: Out of memory for Select query.
Previous:From: Tom LaneDate: 2008-06-30 04:48:44
Subject: Re: Subquery WHERE IN or WHERE EXISTS faster?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group