Re: PostgreSQL NOT IN performance

From: DANIEL CRISTIAN CRUZ <daniel(dot)cruz(at)sc(dot)senai(dot)br>
To: ??????? ???????? <tivv00(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: PostgreSQL NOT IN performance
Date: 2008-11-19 12:22:06
Message-ID: 91db404146cba5910fcafdba91db5b9b@intranet.sc.senai.br
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Something weird with your example which doesn't have the same result, see
row count with explain analyze:

cruz=# SELECT version();
version

--------------------------------------------------------------------------------------------
PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real
(Debian 4.3.2-1) 4.3.2
(1 registro)

cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from
t2);
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=1643.00..4928.00 rows=100000 width=4) (actual
time=256.687..585.774 rows=73653 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual
time=0.052..86.867 rows=100000 loops=1)
Total runtime: 625.471 ms
(5 registros)

cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
SetOp Except All (cost=34469.90..35969.90 rows=30000 width=4) (actual
time=2598.574..3663.712 rows=126733 loops=1)
-> Sort (cost=34469.90..35219.90 rows=300000 width=4) (actual
time=2598.550..3178.387 rows=300000 loops=1)
         Sort Key: "*SELECT* 1".id
         Sort Method:  external merge  Disk: 5864kB
         ->  Append  (cost=0.00..7178.00 rows=300000 width=4) (actual
time=0.037..1026.367 rows=300000 loops=1)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4785.00
rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
                     ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000
width=4) (actual time=0.029..161.355 rows=200000 loops=1)
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2393.00
rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
                     ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000
width=4) (actual time=0.097..110.639 rows=100000 loops=1)
 Total runtime: 3790.831 ms
(10 registros)
</pre>
Sometimes I got a better result (on older versions) with this kind of
query, but in this case it doesn't:

cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE
t2.id IS NULL;
                                                      QUERY
PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=30092.86..35251.53 rows=155304 width=8) (actual
time=850.232..1671.091 rows=73653 loops=1)
   Merge Cond: (t2.id = t1.id)
   Filter: (t2.id IS NULL)
   ->  Sort  (cost=9697.82..9947.82 rows=100000 width=4) (actual
time=266.501..372.560 rows=100000 loops=1)
         Sort Key: t2.id
         Sort Method:  quicksort  Memory: 4392kB
         ->  Seq Scan on t2  (cost=0.00..1393.00 rows=100000 width=4)
(actual time=0.029..78.087 rows=100000 loops=1)
   ->  Sort  (cost=20394.64..20894.64 rows=200000 width=4) (actual
time=583.699..855.427 rows=273364 loops=1)
         Sort Key: t1.id
         Sort Method:  quicksort  Memory: 8784kB
         ->  Seq Scan on t1  (cost=0.00..2785.00 rows=200000 width=4)
(actual time=0.087..155.665 rows=200000 loops=1)
 Total runtime: 1717.062 ms
(12 registros)
</pre>
Regards,

"??????? ????????" <tivv00(at)gmail(dot)com> escreveu:

>Hello.
>
>It's second query rewrite postgresql seems not to handle - making EXCEPT
>from NOT IT.
>Here is an example:
>Preparation:
>
>drop table if exists t1;
>drop table if exists t2;
>create temporary table t1(id) as
>select
>(random()*100000)::int from generate_series(1,200000) a(id);
>
>create temporary table t2(id) as
>select
>(random()*100000)::int from generate_series(1,100000) a(id);
>analyze t1;
>analyze t2;
>
>Query 1:
>select * from t1 where id not in (select id from t2);
>Plan:
>"Seq Scan on t1  (cost=1934.00..164105319.00 rows=100000 width=4)"
>"  Filter: (NOT (subplan))"
>"  SubPlan"
>"    ->  Materialize  (cost=1934.00..3325.00 rows=100000 width=4)"
>"          ->  Seq Scan on t2  (cost=0.00..1443.00 rows=100000 width=4)"
>
>Query 2 (gives same result as Q1):
>select * from t1 except all (select id from t2);
>Plan:
>"SetOp Except All  (cost=38721.90..40221.90 rows=30000 width=4)"
>"  ->  Sort  (cost=38721.90..39471.90 rows=300000 width=4)"
>"        Sort Key: "*SELECT* 1".id"
>"        ->  Append  (cost=0.00..7328.00 rows=300000 width=4)"
>"              ->  Subquery Scan "*SELECT* 1"  (cost=0.00..4885.00
>rows=200000 width=4)"
>"                    ->  Seq Scan on t1  (cost=0.00..2885.00 rows=200000
>width=4)"
>"              ->  Subquery Scan "*SELECT* 2"  (cost=0.00..2443.00
>rows=100000 width=4)"
>"                    ->  Seq Scan on t2  (cost=0.00..1443.00 rows=100000
>width=4)"
>
>If I am correct, planner simply do not know that he can rewrite NOT IN as
>"EXCEPT ALL" operator, so all NOT INs when list of values to remove is long
>takes very much time.
>So the question is: I am willing to participate in postgresql development
>because it may be easier to fix planner then to rewrite all my queries :).
>How can I? (I mean to work on query planner enhancements by providing new
>options of query rewrite, not to work on other thing nor on enhancing
>planner in other ways, like better estimations of known plans).
>
>
>
>

 

--
<span style="color: #000080">Daniel Cristian Cruz
</span>Administrador de Banco de Dados
Direção Regional - Núcleo de Tecnologia da Informação
SENAI - SC
Telefone: 48-3239-1422 (ramal 1422)

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alvaro Herrera 2008-11-19 13:11:43 Re: PostgreSQL NOT IN performance
Previous Message Віталій Тимчишин 2008-11-19 11:51:47 PostgreSQL NOT IN performance