Should the optimizer see this?

From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: rm_pg(at)cheapcomplexdevices(dot)com
Subject: Should the optimizer see this?
Date: 2005-01-27 01:27:59
Message-ID: Pine.LNX.4.58.0501261636010.9564@greenie.cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Short summary... the second query runs faster, and I think
they should be identical queries. Should the optimizer
have found this optimization?

I have two identical (or so I believe) queries; one where I
explicitly add a "is not null" comparison; and one where I
think it would implicitly only find not-null columns.

The queries are

select *
from rt4, rt5
where rt4.tigerfile = rt5.tigerfile
and feat = feat3;

and

select *
from (select * from rt4 where feat3 is not null) as rt4, rt5
where rt4.tigerfile = rt5.tigerfile
and feat = feat3;

I would have thought that the optimizer would see that
if feat3 is null (which it usually is), it doesn't need
to keep those rows and sort them -- but it seems (looking
both at explain analyze and "du" on the tmp directory)
that in the first query it is indeed sorting all the
rows --- even the ones with feat3=null.

The tables are the Census Tiger Line data explained in detail here:
http://www.census.gov/geo/www/tiger/tiger2003/TGR2003.pdf
I can attach the create statemnts for the tables if people
think they'd help. Basically, table rt4 has a column
called feat3 which is usually null, and table rt5 has a
column called feat which is never null. Both tables have
a few million rows.

No indexes were used, since I'm joining everything to
everything, they shouldn't have helped anyway. However
vacuum analyze was run, and (as seen in the second query)
the stats did know that the column feat3 was mostly null.

=====================================================================================================
fli=#
fli=# explain analyze
select *
from rt4, rt5
where rt4.tigerfile = rt5.tigerfile
and feat = feat3;
fli-# fli-# fli-# fli-#

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=1922903.02..1967385.35 rows=117698 width=100) (actual time=179246.872..218920.724 rows=153091 loops=1)
Merge Cond: (("outer".feat3 = "inner".feat) AND ("outer".tigerfile = "inner".tigerfile))
-> Sort (cost=876532.10..888964.80 rows=4973079 width=45) (actual time=57213.327..67313.216 rows=4971022 loops=1)
Sort Key: rt4.feat3, rt4.tigerfile
-> Seq Scan on rt4 (cost=0.00..94198.79 rows=4973079 width=45) (actual time=0.053..10433.883 rows=4971022 loops=1)
-> Sort (cost=1046370.92..1060457.95 rows=5634813 width=55) (actual time=122033.463..134037.127 rows=5767675 loops=1)
Sort Key: rt5.feat, rt5.tigerfile
-> Seq Scan on rt5 (cost=0.00..127146.13 rows=5634813 width=55) (actual time=0.016..22538.958 rows=5635077 loops=1)
Total runtime: 219632.580 ms
(9 rows)

fli=# fli=# fli=#
fli=# explain analyze
select *
from (select * from rt4 where feat3 is not null) as rt4, rt5
where rt4.tigerfile = rt5.tigerfile
and feat = feat3;

fli-# fli-# fli-# fli-# QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=1152466.47..1194789.77 rows=3296 width=100) (actual time=125982.562..145927.220 rows=153091 loops=1)
Merge Cond: (("outer".feat3 = "inner".feat) AND ("outer".tigerfile = "inner".tigerfile))
-> Sort (cost=106095.56..106443.67 rows=139247 width=45) (actual time=11729.319..11823.006 rows=153091 loops=1)
Sort Key: tgr.rt4.feat3, tgr.rt4.tigerfile
-> Seq Scan on rt4 (cost=0.00..94198.79 rows=139247 width=45) (actual time=32.404..10893.373 rows=153091 loops=1)
Filter: (feat3 IS NOT NULL)
-> Sort (cost=1046370.92..1060457.95 rows=5634813 width=55) (actual time=114253.157..126650.225 rows=5767675 loops=1)
Sort Key: rt5.feat, rt5.tigerfile
-> Seq Scan on rt5 (cost=0.00..127146.13 rows=5634813 width=55) (actual time=0.012..19253.431 rows=5635077 loops=1)
Total runtime: 146480.294 ms
(10 rows)

fli=# fli=#
fli=#

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ken Egervari 2005-01-27 02:17:23 Performance problem with semi-large tables
Previous Message Steve Poe 2005-01-26 23:49:54 Re: Ideal disk setup for Postgresql 7.4?