unexpected results with NOT IN query

From: "Mason Hale" <masonhale(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: unexpected results with NOT IN query
Date: 2008-03-20 14:45:25
Message-ID: 8bca3aa10803200745q147e06f6j39ef9c5ab4b21b52@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello --

I'm getting some unexpected results with a NOT IN query -- this is on 8.2.5.

This is the query in question:

prod_2=> select id from feed_download_task where id in (111102466,141701504)
and id not in (select last_feed_download_task_id from subscription);
id
----
(0 rows)

This query returns zero rows, but I expect it to return 1 row, because I
know that 111102466 *is not* in (select last_feed_download_task_id from
subscription) and I know that 1471701504 *is* in that set, as demonstrated
below:

Verify that both id values are in the feed_download_task table:

prod_2=> select id from feed_download_task where id in
(111102466,141701504);
id
-----------
141701504
111102466
(2 rows)

Verify that 111102466 is NOT in the set of last_feed_download_task_id's, and
that 141701504 is in this set:

prod_2=> select last_feed_download_task_id from subscription where
last_feed_download_task_id in (111102466,141701504);
last_feed_download_task_id
----------------------------
141701504
(1 row)

Here's the problem query again, with explain analyze.

prod_2=> select id from feed_download_task where id in (111102466,141701504)
and id not in (select last_feed_download_task_id from subscription);
id
----
(0 rows)

prod_2=> explain analyze select id from feed_download_task where id in
(111102466,141701504) and id not in (select last_feed_download_task_id from
subscription);
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on feed_download_task (cost=45077.24..45083.27 rows=1
width=4) (actual time=601.229..601.229 rows=0 loops=1)
Recheck Cond: (id = ANY ('{111102466,141701504}'::integer[]))
Filter: (NOT (hashed subplan))
-> Bitmap Index Scan on feed_download_task_pkey
(cost=0.00..30.52rows=2 width=0) (actual time=
0.095..0.095 rows=2 loops=1)
Index Cond: (id = ANY ('{111102466,141701504}'::integer[]))
SubPlan
-> Seq Scan on subscription (cost=0.00..44097.78 rows=379578 width=4)
(actual time=0.032..488.193 rows=162365 loops=1)
Total runtime: 601.281 ms
(8 rows)

I've tried re-analyzing and re-indexing the tables involved in this query,
but I still left scratching my head.

I am also aware that I can use a left join instead of a NOT IN query -- but
in this case I need to use this in a DELETE statement, which eliminates the
possibility of the left join (I think).

Here is a version using a left outer join, it returns the expected result:

prod_2=# select feed_download_task.id from feed_download_task left join
subscription on (subscription.last_feed_download_task_id =
feed_download_task.id) where feed_download_task.id in (111102466,141701504)
and subscription.id IS NOT NULL;
id
-----------
141701504
(1 row)

Here is the explain analyze output for the above query:

prod_2=# explain analyze select feed_download_task.id from
feed_download_task left join subscription on (
subscription.last_feed_download_task_id = feed_download_task.id) where
feed_download_task.id in (111102466,141701504) and subscription.id IS NOT
NULL;

QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=31.19..51.69 rows=1 width=4) (actual
time=0.158..0.210rows=1 loops=1)
-> Bitmap Heap Scan on feed_download_task (cost=31.19..37.21 rows=2
width=4) (actual time=0.120..0.134 rows=2 loops=1)
Recheck Cond: (id = ANY ('{111102466,141701504}'::integer[]))
-> Bitmap Index Scan on feed_download_task_pkey
(cost=0.00..31.19rows=2 width=0) (actual time=
0.102..0.102 rows=2 loops=1)
Index Cond: (id = ANY ('{111102466,141701504}'::integer[]))
-> Index Scan using index_subscription_on_last_feed_download_task_id on
subscription (cost=0.00..7.23 rows=1 width=4) (actual
time=0.036..0.037rows=0 loops=2)
Index Cond: (subscription.last_feed_download_task_id =
feed_download_task.id)
Filter: (id IS NOT NULL)

I feel like I must be missing something obvious.

Thanks in advance for the assistance.

cheers,
Mason

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2008-03-20 15:15:49 Re: unexpected results with NOT IN query
Previous Message Adrian Klaver 2008-03-20 14:43:39 Re: Row size overhead