Terrible perfomance during nested "... where x in (select ...)" operator

From: Черепанов Леонид <leo(at)rusmedia(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Terrible perfomance during nested "... where x in (select ...)" operator
Date: 2001-05-08 13:20:20
Message-ID: D078291436B0D411B08800805F6D1315073F13@fantom.tcnet.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Leonid (leo(at)rusmedia(dot)net) reports a bug with a severity of 2(?)

Short Description
Terrible perfomance during nested "... where x in (select ...)" operator

Long Description
PostgreSQL 7.1, FreeBSD 4.2-STABLE

Analyzing the reasons for terrible perfomance of my query I've found
a very strange thing. Here is sublimation.

Queries like
select distinct i from t1 where i in
(select distinct i from t2 where j in
(select distinct j from t3));
is much-much slower than
select distinct t1.i from t1,t2,t3 where t1.i=t2.i and t2.j=t3.j;

EXPLAIN results.
The 1st one:
NOTICE: QUERY PLAN:
Unique (cost=0.00..62059410444.36 rows=15000 width=4)
-> Index Scan using t1_i on t1 (cost=0.00..62059410069.36 rows=150000
width=4)
SubPlan
-> Materialize (cost=413729.34..413729.34 rows=400 width=4)
-> Unique (cost=0.00..413729.34 rows=400 width=4)
-> Index Scan using t2_i on t2 (cost=0.00..413719.34
rows=4000 width=4)
SubPlan
-> Materialize (cost=103.38..103.38 rows=225
width=4)
-> Unique (cost=0.00..103.38 rows=225
width=4)
-> Index Scan using t3_j on t3
(cost=0.00..97.76 rows=2250 wid
th=4)
EXPLAIN

And the second:
NOTICE: QUERY PLAN:
Unique (cost=362.62..10472.26 rows=2768 width=16)
-> Merge Join (cost=362.62..10403.06 rows=27681 width=16)
-> Index Scan using t1_i on t1 (cost=0.00..8145.36 rows=150000
width=4)
-> Sort (cost=362.62..362.62 rows=1606 width=12)
-> Hash Join (cost=102.63..277.13 rows=1606 width=12)
-> Seq Scan on t3 (cost=0.00..34.50 rows=2250 width=4)
-> Hash (cost=62.00..62.00 rows=4000 width=8)
-> Seq Scan on t2 (cost=0.00..62.00 rows=4000
width=8)
EXPLAIN

*** IMPORTANT !!! ***
VACUUM and VACUUM ANALIZY make things even worse;
NOTICE: QUERY PLAN:
Unique (cost=0.00..186176903970.36 rows=15000 width=4)
-> Index Scan using t1_i on t1 (cost=0.00..186176903595.36 rows=150000
width=4)
SubPlan
-> Materialize (cost=1241179.30..1241179.30 rows=1200 width=4)
-> Unique (cost=0.00..1241179.30 rows=1200 width=4)
-> Index Scan using t2_i on t2
(cost=0.00..1241149.30 rows=12000 width=4)
SubPlan
-> Materialize (cost=103.38..103.38 rows=225
width=4)
-> Unique (cost=0.00..103.38 rows=225
width=4)
-> Index Scan using t3_j on t3
(cost=0.00..97.76 rows=2250 wid
th=4)
EXPLAIN

NOTICE: QUERY PLAN:
Unique (cost=1027.91..11289.49 rows=7240 width=16)
-> Merge Join (cost=1027.91..11108.49 rows=72401 width=16)
-> Index Scan using t1_i on t1 (cost=0.00..8145.36 rows=150000
width=4)
-> Sort (cost=1027.91..1027.91 rows=4817 width=12)
-> Hash Join (cost=40.12..733.25 rows=4817 width=12)
-> Seq Scan on t2 (cost=0.00..185.00 rows=12000
width=8)
-> Hash (cost=34.50..34.50 rows=2250 width=4)
-> Seq Scan on t3 (cost=0.00..34.50 rows=2250
width=4)
EXPLAIN

Real working times differ 8 TIMES !

Here is a perl script, which generates a PgSQL files with sample tables and
datadump.
The "scale" parameter controls the amount of data, generated by the script.
Recommended "scale" value is 0.1 for typical computer.

--------------------------8<--------------------------
open (F, "> create");
print F (<<CREATE);
drop table t1;
drop table t2;
drop table t3;

create table t1 (i int);
create table t2 (i int, j int);
create table t3 (j int);

create index t1_i on t1 (i);
create index t2_i on t2 (i);
create index t2_j on t2 (j);
create index t3_j on t3 (j);
CREATE

open (F, "> compare");
print F (<<COMPARE);
explain
select count(distinct i) from t1 where i in
(select distinct i from t2 where j in
(select distinct j from t3));
select now();select count(distinct i) from t1 where i in (select distinct i
from t2 where j in (select distinct j from t3));select now();

explain select count(distinct t1.i) from t1,t2,t3 where t1.i=t2.i and
t2.j=t3.j;
select now();select count(distinct t1.i) from t1,t2,t3 where t1.i=t2.i and
t2.j=t3.j;select now();

explain
select count(distinct t1.i) from t1 inner join
(select distinct t2.i from t2 inner join
(select distinct j from t3) as tt3 on tt3.j=t2.j
) as tt2 on tt2.i=t1.i;
select now();select count(distinct t1.i) from t1 inner join (select distinct
t2.i from t2 inner join (select distinct j from t3) as tt3 on tt3.j=t2.j )
as tt2 on tt2.i=t1.i;select now();
COMPARE

print ("Scale: ");
$m=<>;
chomp $m;
open (F, "> data." . $m);
open (F, ">> data." . $m);
print F ("copy t1 from stdin;\n");
for($k=0;$k<300000*$m;$k++)
{
$r = int ( rand(4000*$m) );
print F ("$r\n");
}
print F (<<T1);
\\.

select count(*) as t1_rows from t1;
select count(distinct i) as t1_dist_rows from t1;
T1

print F ("\ncopy t2 from stdin;\n");
for($k=0;$k<8000*$m;$k++)
{
$r = int ( rand(4000*$m) );
print F ("$r\t$k\n");
}
print F (<<T2);
\\.

select count(*) as t2_rows from t2;
select count(distinct i) as t2_dist_rows from t2;
T2

print F ("\ncopy t3 from stdin;\n");
for($k=0;$k<4500*$m;$k++)
{
$r = int ( rand(8000*$m) );
print F ("$r\n");
}
print F (<<T3);
\\.

select count(*) as t3_rows from t3;
select count(distinct j) as t3_dist_rows from t3;
T3
--------------------------8<--------------------------

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Nabil Sayegh 2001-05-08 13:38:27 Re: Alter table add column ignores default
Previous Message Palle Girgensohn 2001-05-08 12:36:19 Re: freebsd sample startup script doesn't work