Re: Inner join vs where-clause subquery

From: "Jeremy Haile" <jhaile(at)fastmail(dot)fm>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Inner join vs where-clause subquery
Date: 2006-12-19 20:47:56
Message-ID: 1166561276.10609.281240077@webmail.messagingengine.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I'm still confused as to why the inner join version ran so much faster
than the where-clause version.

Here's the inner join query and explain ouput:
select min(nlogid) as start_nlogid,
max(nlogid) as end_nlogid,
min(dtCreateDate) as start_transaction_timestamp,
max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < '2006-12-18 9:10'

Aggregate (cost=246226.95..246226.96 rows=1 width=12)
-> Nested Loop (cost=49233.27..231209.72 rows=1501722 width=12)
-> Result (cost=0.04..0.05 rows=1 width=0)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=4)
-> Index Scan Backward using
activity_log_import_history_end_nlogid_idx on
activity_log_import_history (cost=0.00..114.97
rows=2913 width=4)
Filter: (end_nlogid IS NOT NULL)
-> Bitmap Heap Scan on activity_log_facts
(cost=49233.23..210449.44 rows=1660817 width=12)
Recheck Cond: (activity_log_facts.nlogid >
a.previous_nlogid)
Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp
without time zone)
-> Bitmap Index Scan on activity_log_facts_nlogid_idx
(cost=0.00..49233.23 rows=1660817 width=0)
Index Cond: (activity_log_facts.nlogid >
a.previous_nlogid)

Since the inner join is basically the same thing as doing the
where-clause subquery, why does it generate a far different plan?

On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev(at)archonet(dot)com>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted:
> >
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> > max(nlogid) as end_nlogid,
> > min(dtCreateDate) as start_transaction_timestamp,
> > max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> >
> > EXPLAIN ANALYZE
> > Aggregate (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> > -> Index Scan using activity_log_facts_nlogid_idx on
> > activity_log_facts (cost=0.00..652.64 rows=472 width=12) (actual
> > time=0.014..0.014 rows=0 loops=1)
> > Index Cond: (nlogid > 478287801)
> > Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> > time zone)
> > Total runtime: 0.076 ms
> >
> >
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains. Why is this explain so much simpler than the
> > other query plan (with the subquery)?
>
> Because it's planning it with knowledge of what "nlogid"s it's filtering
> by. It knows it isn't going to get many rows back with nlogid >
> 478287801. In your previous explain it thought a large number of rows
> would match and was trying not to sequentially scan the
> activity_log_facts table.
>
> Ideally, the planner would evaluate the subquery in your original form
> (it should know it's only getting one row back from max()). Then it
> could plan the query as above. I'm not sure how tricky that is to do
> though.
>
> --
> Richard Huxton
> Archonet Ltd

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message ALVARO ARCILA 2006-12-20 05:31:21 max_fsm_pages and check_points
Previous Message Jeremy Haile 2006-12-19 20:18:07 Re: Inner join vs where-clause subquery