sub-query optimization

From: Brad Hilton <bhilton(at)vpop(dot)net>
To: pgsql-sql(at)postgresql(dot)org
Subject: sub-query optimization
Date: 2003-02-14 18:22:19
Message-ID: 1045246939.29966.22.camel@aragorn.vpop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hello,

I am hoping someone can help explain why modifying the following query
can effect such a huge change in speed. The query is:

select * from articles
where exists
( select 1 from article_categories
where
article_categories.article_id = articles.id and
article_categories.category_id is null
)

The original query was much more complex, but I have trimmed it down to
highlight the problem. The query above also manifests the problem. OK,
the above query (with 100,000 records in the articles table) takes 1292
msec (see output below). If I modify the query slightly:

--------
select 1 from article_categories
-->
select 1 from articles, article_categories
---------

the query takes 98 msec. Now, normally, I would just leave the query at
that and enjoy my newfound speed. But, adding that extra table to the
inner query only helps when the inner query matches few or no records.
In my sample dataset, there are no records where category_id is null.
However, if I modify the inner query such that it matches many records,
adding that extra table to the inner query has the opposite effect. It
kills the speed.

I am wondering if there is a way to write a query that performs
effeciently, regardless of the number of records the inner query
matches. Can anyone offer any help?

Thanks!
-Brad

============================================
TABLE STRUCTURE:
---------------
create table articles (
id serial primary key
);
create index articles_publish_time_key on articles (publish_time);

create table categories (
id serial primary key
);

create table article_categories (
article_id int not null references articles,
category_id int not null references categories,
primary key(article_id, category_id)
);
create index article_categories_article_id_key on article_categories
(article_id);
----------------------------------------

EXPLAIN ANALYZE OUTPUT
--------------------------

(FAST QUERY)
Result (cost=0.00..4217.00 rows=100000 width=187) (actual
time=98.00..98.00 rows=0 loops=1)
One-Time Filter: $0
InitPlan
-> Nested Loop (cost=0.00..2984.03 rows=1 width=8) (actual
time=97.98..97.98 rows=0 loops=1)
-> Seq Scan on article_categories (cost=0.00..2981.00
rows=1 width=4) (actual time=97.98..97.98 rows=0 loops=1)
Filter: (category_id IS NULL)
-> Index Scan using articles_pkey on articles
(cost=0.00..3.01 rows=1 width=4) (never executed)
Index Cond: ("outer".article_id = articles.id)
-> Seq Scan on articles (cost=0.00..4217.00 rows=100000 width=187)
(never executed)
Total runtime: 98.24 msec
(10 rows)

------------------------------------------------
SLOW QUERY:
-------------
Seq Scan on articles (cost=0.00..306827.16 rows=50000 width=187)
(actual time=1292.48..1292.48 rows=0 loops=1)
Filter: (subplan)
SubPlan
-> Index Scan using article_categories_article_id_key on
article_categories (cost=0.00..3.03 rows=1 width=0) (actual
time=0.01..0.01 rows=0 loops=100000)
Index Cond: (article_id = $0)
Filter: (category_id IS NULL)
Total runtime: 1292.68 msec
(7 rows)

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Brad Hilton 2003-02-14 18:38:51 Re: sub-query optimization
Previous Message Mintoo Lall 2003-02-14 18:12:14 Re: Drop temporary table only if it exists