Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)

From: Kevin Goess <kgoess(at)bepress(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)
Date: 2012-03-19 11:22:22
Message-ID: CABZkbxgbE4cnzut2Kr9zCcXt=OowR7CD950By3i-jTPLa_ykTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Mar 19, 2012 at 9:24 AM, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>wrote:

> That means that your statistics are not accurate.
>
> As a first measure, you should ANALYZE the tables involved and see if
> the problem persists. If yes, post the new plans.
>

Aha, thanks, that explains why my test table with one row was so bad. But
even with all freshly ANALYZE'd tables, I still see the query reverting to
a sequential scan on that big contexts table once the number of rows in the
subselect goes over 199. Here's a simplified version that demonstrates the
problem.

production=> explain (analyze, buffers) SELECT contexts.context_key FROM
contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE
contexts.context_key IN (SELECT context_key FROM virtual_ancestors limit
200) AND articles.indexed;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=7086.13..219322.15 rows=411736 width=4) (actual
time=50.118..1213.046 rows=35 loops=1)
Hash Cond: (contexts.context_key = articles.context_key)
Buffers: shared hit=72539 read=100104
-> Seq Scan on contexts (cost=0.00..190285.83 rows=1783283 width=4)
(actual time=0.040..769.891 rows=1786074 loops=1)
Buffers: shared hit=72399 read=100054
-> Hash (cost=1939.43..1939.43 rows=411736 width=8) (actual
time=3.510..3.510 rows=35 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2kB
Buffers: shared hit=140 read=50
-> Nested Loop (cost=6.18..1939.43 rows=411736 width=8) (actual
time=0.203..3.487 rows=35 loops=1)
Buffers: shared hit=140 read=50
-> HashAggregate (cost=6.18..8.18 rows=200 width=4)
(actual time=0.174..0.198 rows=48 loops=1)
Buffers: shared read=2
-> Limit (cost=0.00..3.68 rows=200 width=4) (actual
time=0.015..0.108 rows=200 loops=1)
Buffers: shared read=2
-> Seq Scan on virtual_ancestors
(cost=0.00..87676.17 rows=4759617 width=4) (actual time=0.015..0.075
rows=200 loops=1)
Buffers: shared read=2
-> Index Scan using articles_pkey on articles
(cost=0.00..9.64 rows=1 width=4) (actual time=0.015..0.068 rows=1 loops=48)
Index Cond: (articles.context_key =
virtual_ancestors.context_key)
Filter: articles.indexed
Buffers: shared hit=140 read=48
Total runtime: 1213.138 ms
(21 rows)

But if I write the keys in the subquery inline, I get a very nice execution
plan, all the way up to a tested maximum of about 50,000 keys:

production=> explain (analyze, buffers) SELECT contexts.context_key FROM
contexts JOIN articles ON (articles.context_key=contexts.context_key) WHERE
contexts.context_key IN (2482612,2482612,...) AND articles.indexed;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=758.71..3418.40 rows=200 width=4) (actual
time=0.621..1.089 rows=35 loops=1)
Buffers: shared hit=826 read=1
-> Bitmap Heap Scan on contexts (cost=752.58..1487.55 rows=200
width=4) (actual time=0.604..0.699 rows=48 loops=1)
Recheck Cond: (context_key = ANY
('{2482612,2482612,...}'::integer[]))
Buffers: shared hit=639
-> Bitmap Index Scan on contexts_pkey (cost=0.00..752.53
rows=200 width=0) (actual time=0.591..0.591 rows=200 loops=1)
Index Cond: (context_key = ANY
('{2482612,2482612,...}'::integer[]))
Buffers: shared hit=600
-> Bitmap Heap Scan on articles (cost=6.13..9.64 rows=1 width=4)
(actual time=0.007..0.007 rows=1 loops=48)
Recheck Cond: (articles.context_key = contexts.context_key)
Filter: articles.indexed
Buffers: shared hit=187 read=1
-> Bitmap Index Scan on articles_pkey (cost=0.00..6.13 rows=1
width=0) (actual time=0.005..0.005 rows=1 loops=48)
Index Cond: (articles.context_key = contexts.context_key)
Buffers: shared hit=148
Total runtime: 1.147 ms

Is this expected behavior, that writing the ids inline does much better
than the subquery? I've been told that it's not, but this isn't the first
time I've seen this, so I feel like I'm not understanding something.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bill Moran 2012-03-19 12:22:57 Re: Anonymized database dumps
Previous Message Florent THOMAS 2012-03-19 09:49:52 Re: Multi server query