From: | DANIEL CRISTIAN CRUZ <daniel(dot)cruz(at)sc(dot)senai(dot)br> |
---|---|
To: | ??????? ???????? <tivv00(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: PostgreSQL NOT IN performance |
Date: | 2008-11-19 12:22:06 |
Message-ID: | 91db404146cba5910fcafdba91db5b9b@intranet.sc.senai.br |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Something weird with your example which doesn't have the same result, see
row count with explain analyze:
cruz=# SELECT version();
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real
(Debian 4.3.2-1) 4.3.2
(1 registro)
cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from
t2);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=1643.00..4928.00 rows=100000 width=4) (actual
time=256.687..585.774 rows=73653 loops=1)
Filter: (NOT (hashed subplan))
SubPlan
-> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4) (actual
time=0.052..86.867 rows=100000 loops=1)
Total runtime: 625.471 ms
(5 registros)
cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
SetOp Except All (cost=34469.90..35969.90 rows=30000 width=4) (actual
time=2598.574..3663.712 rows=126733 loops=1)
-> Sort (cost=34469.90..35219.90 rows=300000 width=4) (actual
time=2598.550..3178.387 rows=300000 loops=1)
Sort Key: "*SELECT* 1".id
Sort Method: external merge Disk: 5864kB
-> Append (cost=0.00..7178.00 rows=300000 width=4) (actual
time=0.037..1026.367 rows=300000 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..4785.00
rows=200000 width=4) (actual time=0.035..439.507 rows=200000 loops=1)
-> Seq Scan on t1 (cost=0.00..2785.00 rows=200000
width=4) (actual time=0.029..161.355 rows=200000 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..2393.00
rows=100000 width=4) (actual time=0.107..255.160 rows=100000 loops=1)
-> Seq Scan on t2 (cost=0.00..1393.00 rows=100000
width=4) (actual time=0.097..110.639 rows=100000 loops=1)
Total runtime: 3790.831 ms
(10 registros)
</pre>
Sometimes I got a better result (on older versions) with this kind of
query, but in this case it doesn't:
cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE
t2.id IS NULL;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------
Merge Right Join (cost=30092.86..35251.53 rows=155304 width=8) (actual
time=850.232..1671.091 rows=73653 loops=1)
Merge Cond: (t2.id = t1.id)
Filter: (t2.id IS NULL)
-> Sort (cost=9697.82..9947.82 rows=100000 width=4) (actual
time=266.501..372.560 rows=100000 loops=1)
Sort Key: t2.id
Sort Method: quicksort Memory: 4392kB
-> Seq Scan on t2 (cost=0.00..1393.00 rows=100000 width=4)
(actual time=0.029..78.087 rows=100000 loops=1)
-> Sort (cost=20394.64..20894.64 rows=200000 width=4) (actual
time=583.699..855.427 rows=273364 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 8784kB
-> Seq Scan on t1 (cost=0.00..2785.00 rows=200000 width=4)
(actual time=0.087..155.665 rows=200000 loops=1)
Total runtime: 1717.062 ms
(12 registros)
</pre>
Regards,
"??????? ????????" <tivv00(at)gmail(dot)com> escreveu:
>Hello.
>
>It's second query rewrite postgresql seems not to handle - making EXCEPT
>from NOT IT.
>Here is an example:
>Preparation:
>
>drop table if exists t1;
>drop table if exists t2;
>create temporary table t1(id) as
>select
>(random()*100000)::int from generate_series(1,200000) a(id);
>
>create temporary table t2(id) as
>select
>(random()*100000)::int from generate_series(1,100000) a(id);
>analyze t1;
>analyze t2;
>
>Query 1:
>select * from t1 where id not in (select id from t2);
>Plan:
>"Seq Scan on t1 (cost=1934.00..164105319.00 rows=100000 width=4)"
>" Filter: (NOT (subplan))"
>" SubPlan"
>" -> Materialize (cost=1934.00..3325.00 rows=100000 width=4)"
>" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000 width=4)"
>
>Query 2 (gives same result as Q1):
>select * from t1 except all (select id from t2);
>Plan:
>"SetOp Except All (cost=38721.90..40221.90 rows=30000 width=4)"
>" -> Sort (cost=38721.90..39471.90 rows=300000 width=4)"
>" Sort Key: "*SELECT* 1".id"
>" -> Append (cost=0.00..7328.00 rows=300000 width=4)"
>" -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00
>rows=200000 width=4)"
>" -> Seq Scan on t1 (cost=0.00..2885.00 rows=200000
>width=4)"
>" -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00
>rows=100000 width=4)"
>" -> Seq Scan on t2 (cost=0.00..1443.00 rows=100000
>width=4)"
>
>If I am correct, planner simply do not know that he can rewrite NOT IN as
>"EXCEPT ALL" operator, so all NOT INs when list of values to remove is long
>takes very much time.
>So the question is: I am willing to participate in postgresql development
>because it may be easier to fix planner then to rewrite all my queries :).
>How can I? (I mean to work on query planner enhancements by providing new
>options of query rewrite, not to work on other thing nor on enhancing
>planner in other ways, like better estimations of known plans).
>
>
>
>
--
<span style="color: #000080">Daniel Cristian Cruz
</span>Administrador de Banco de Dados
Direção Regional - Núcleo de Tecnologia da Informação
SENAI - SC
Telefone: 48-3239-1422 (ramal 1422)
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2008-11-19 13:11:43 | Re: PostgreSQL NOT IN performance |
Previous Message | Віталій Тимчишин | 2008-11-19 11:51:47 | PostgreSQL NOT IN performance |