Re: Reason of Slowness of query

From: Chetan Suttraway <chetan(dot)suttraway(at)enterprisedb(dot)com>
To: Adarsh Sharma <adarsh(dot)sharma(at)orkash(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Samuel Gendler <sgendler(at)ideasculptor(dot)com>
Subject: Re: Reason of Slowness of query
Date: 2011-03-23 07:25:35
Message-ID: AANLkTi=G8+LEO8D_5JY96scigx8OfcizqB5PbNcK=Ov5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Mar 23, 2011 at 12:50 PM, Samuel Gendler
<sgendler(at)ideasculptor(dot)com>wrote:

> On Tue, Mar 22, 2011 at 11:28 PM, Adarsh Sharma <adarsh(dot)sharma(at)orkash(dot)com>wrote:
>
>> *
>> *I perform a join query on it as :
>>
>> * explain analyze select distinct(p.crawled_page_id) from page_content p
>> , clause2 c where p.crawled_page_id != c.source_id ;
>>
>> *What it takes more than 1 hour to complete. As I issue the explain
>> analyze command and cannot able to wait for output but I send my explain
>> output as :
>>
>
>
> please describe what your query is trying to select, as it is possible that
> query isn't doing what you think it is. joining 2 tables where id1 != id2
> will create a cross multiple of the two tables such that every row from the
> first table is matched with every single row from the second table that
> doesn't have a matching id. Then you are looking for distinct values on
> that potentially enormous set of rows.
>
> db_v2=# select * from table1;
> id | value
> ----+-------
> 1 | 1
> 2 | 2
> 3 | 3
> (3 rows)
>
> db_v2=# select * from table2;
> id | value
> ----+-------
> 1 | 4
> 2 | 5
> 3 | 6
> (3 rows)
>
> db_v2=# select t1.id, t1.value, t2.id, t2.value from table1 t1, table2 t2
> where t1.id != t2.id;
> id | value | id | value
> ----+-------+----+-------
> 1 | 1 | 2 | 5
> 1 | 1 | 3 | 6
> 2 | 2 | 1 | 4
> 2 | 2 | 3 | 6
> 3 | 3 | 1 | 4
> 3 | 3 | 2 | 5
>
> So if you have a couple of million rows in each table, you are selecting
> distinct over a potentially huge set of data. If you are actually trying
> to find all ids from one table which have no match at all in the other
> table, then you need an entirely different query:
>
> db_v2=# insert into table2 (value) values (7);
> INSERT 0 1
>
> db_v2=# select * from table2;
> id | value
> ----+-------
> 1 | 4
> 2 | 5
> 3 | 6
> 4 | 7
>
> db_v2=# select t2.id, t2.value from table2 t2 where not exists (select 1
> from table1 t1 where t1.id = t2.id);
> id | value
> ----+-------
> 4 | 7
>
>

Check this setup:
pg=# create table t1(a int, b int);
CREATE TABLE
pg=# create index t1_b on t1(b);
CREATE INDEX
pg=# create table t2(c int, d int);
CREATE TABLE
pg=# create index t2_cd on t2(c,d);
CREATE INDEX
pg=# explain select distinct(b) from t1,t2 where t1.b !=t2.d;
QUERY
PLAN
-------------------------------------------------------------------------------
Unique (cost=0.00..80198.86 rows=200 width=4)
-> Nested Loop (cost=0.00..68807.10 rows=4556702 width=4)
Join Filter: (t1.b <> t2.d)
-> Index Scan using t1_b on t1 (cost=0.00..76.35 rows=2140
width=4)
-> Materialize (cost=0.00..42.10 rows=2140 width=4)
-> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=4)
(6 rows)

pg=# explain select distinct(b) from t1 where NOT EXISTS (select 1 from t2
where t2.d=t1.b);
QUERY PLAN
------------------------------------------------------------------------
HashAggregate (cost=193.88..193.89 rows=1 width=4)
-> Hash Anti Join (cost=58.15..193.88 rows=1 width=4)
Hash Cond: (t1.b = t2.d)
-> Seq Scan on t1 (cost=0.00..31.40 rows=2140 width=4)
-> Hash (cost=31.40..31.40 rows=2140 width=4)
-> Seq Scan on t2 (cost=0.00..31.40 rows=2140 width=4)
(6 rows)

The cost seems to be on higher side, but maybe on your system with index
scan on t2 and t1, the cost might be on lower side.

Another query which forced index scan was :
pg=# explain select distinct(b) from t1,t2 where t1.b >t2.d union all
select distinct(b) from t1,t2 where t1.b <t2.d;
QUERY
PLAN
-------------------------------------------------------------------------------------
Append (cost=0.00..100496.74 rows=400 width=4)
-> Unique (cost=0.00..50246.37 rows=200 width=4)
-> Nested Loop (cost=0.00..46430.04 rows=1526533 width=4)
-> Index Scan using t1_b on t1 (cost=0.00..76.35 rows=2140
width=4)
-> Index Scan using t2_d on t2 (cost=0.00..12.75 rows=713
width=4)
Index Cond: (public.t1.b > public.t2.d)
-> Unique (cost=0.00..50246.37 rows=200 width=4)
-> Nested Loop (cost=0.00..46430.04 rows=1526533 width=4)
-> Index Scan using t1_b on t1 (cost=0.00..76.35 rows=2140
width=4)
-> Index Scan using t2_d on t2 (cost=0.00..12.75 rows=713
width=4)
Index Cond: (public.t1.b < public.t2.d)
(11 rows)

This looks like to a acceptable.
Please try this above query with your setup and post the explain output.

--
Regards,
Chetan Suttraway
EnterpriseDB <http://www.enterprisedb.com/>, The Enterprise
PostgreSQL<http://www.enterprisedb.com/>
company.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Adarsh Sharma 2011-03-23 07:30:26 Re: Reason of Slowness of query
Previous Message Vitalii Tymchyshyn 2011-03-23 07:25:34 Re: Reason of Slowness of query