join to view over custom aggregate seems like it should be faster

From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: join to view over custom aggregate seems like it should be faster
Date: 2007-04-09 19:05:17
Message-ID: b42b73150704091205o668a084y1693520c974f3c6b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have an odd performance issue on 8.2 that I'd thought I'd document
here. I have a workaround, but I'm if there is something that I'm not
seeing.

ok, for starters:
I have a large table that is basically organized like this:
create table big
(
key1 int,
key2 int,
ts timestamp
[other fields]
);

and a view most_recent_big which lists for each combination of key1
and key2, the '[other fields]' that are behind the highest (most
recent) timestamp. The original view implementation involved a self
join which is the classic sql approach to pulling values from a
denormalized table (the real solution of course is to normalize the
data but I can't do that for various reasons). This wasn't very fast,
so I wrote a custom aggregate to optimize the view (there are usuallly
very small #s of records for key1, key2 pair:

create view latest_big_view as
select key1, key2, max_other_fields[other fields]
from big
group by key1, key2;

This worked very well, but sometimes the index on key1, key2 does not
get utilized when joining against latest_big_view. Let's say I have a
number of key1, key2 pairs in another table:

for example:
select * from foo, latest_big_view using (key1, key2);
breaks down.

here is a example of the 'breakdown' plan on real tables. selecting a
single record from the view is very fast...1ms or less. The join
can't 'see through' the view to filter the index.

dev20400=# explain analyze select * from foo join latest_download
using (host_id, software_binary_id);

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=15.35..4616.65 rows=25 width=90) (actual
time=229.623..10601.317 rows=494 loops=1)
Hash Cond: ((latest_download.host_id = foo.host_id) AND
(latest_download.software_binary_id = foo.software_binary_id))
-> GroupAggregate (cost=0.00..4499.01 rows=4535 width=94) (actual
time=0.346..10370.383 rows=37247 loops=1)
-> Index Scan using software_download_idx on
software_download (cost=0.00..2526.53 rows=45342 width=94) (actual
time=0.028..344.591
SubPlan
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.010 rows=1 loops=37247)
-> Hash (cost=7.94..7.94 rows=494 width=8) (actual
time=5.568..5.568 rows=494 loops=1)
-> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8)
(actual time=0.018..2.686 rows=494 loops=1)
Total runtime: 10604.260 ms
(18 rows)

Here is the same query but on the root table, instead of the view:
dev20400=# explain analyze select * from foo join software_download
using (host_id, software_binary_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..1521.60 rows=19 width=94) (actual
time=0.084..24.992 rows=607 loops=1)
-> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8) (actual
time=0.044..2.753 rows=494 loops=1)
-> Index Scan using software_download_idx on software_download
(cost=0.00..3.05 rows=1 width=94) (actual time=0.011..0.019 rows=1
loops=49
Index Cond: ((foo.host_id = software_download.host_id) AND
(foo.software_binary_id = software_download.software_binary_id))
Total runtime: 28.385 ms
(5 rows)

I can use a trick with a function to make the view give out reasonalbe results:

create function foo(int, int) returns latest_download as
$$ select * from latest_download where software_binary_id = $1 and
host_id = $2; $$ language sql;

dev20400=# explain analyze select (v).* from (select
foo(software_binary_id, host_id) as v from foo) q;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Subquery Scan q (cost=0.00..14.12 rows=494 width=32) (actual
time=1.436..139.644 rows=494 loops=1)
-> Seq Scan on foo (cost=0.00..9.18 rows=494 width=8) (actual
time=1.414..131.144 rows=494 loops=1)
Total runtime: 142.887 ms
(3 rows)

Time: 144.306 ms

merlin

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alex Deucher 2007-04-09 19:14:48 Re: postgres 8.2 seems to prefer Seq Scan
Previous Message Alex Deucher 2007-04-09 18:43:55 Re: postgres 8.2 seems to prefer Seq Scan