mis-estimation on data-warehouse aggregate creation

From: Kris Jurka <books(at)ejurka(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: mis-estimation on data-warehouse aggregate creation
Date: 2004-11-16 09:10:17
Message-ID: Pine.BSO.4.56.0411160339500.31841@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


I've have a miniature data-warehouse in which I'm trying to rebuild
pre-calcuated aggregate data directly in the database and I'm geting some
poor plans due to a bad mis-estimation of the number of rows involved.

In a standard star schema I have a sales fact table and dimensions
product, customer, and period. From those dimensions I have created
"shrunken" versions of them that only have productline, salesperson and
month data. Now I want to rollup the base fact table to a "shrunken"
version with data summed up for these smaller aggregate dimensions.

The idea is to take a sales table (productid, customerid, periodid,
quantity, usdamount) and create a table with the same columns that have
the "id" columns pointing to the matching smaller dimensions and total up
the quantity and usdamount. Since the shrunken dimension tables have
renumbered ids we look these up by joining on all of the common columns
between the base and shrunken dimensions. The following query does just
that:

CREATE TABLE shf_sales_by_salesperson_productline_month AS
SELECT SUM(sales.quantity) AS quantity,
SUM(sales.usdamount) AS usdamount,
shd_productline.id AS productid,
shd_month.id AS periodid,
shd_salesperson.id AS customerid
FROM sales
JOIN (
SELECT shd_productline.id, product.id AS productid
FROM product, shd_productline
WHERE product.productline = shd_productline.productline
AND product.category = shd_productline.category
AND product.platform = shd_productline.platform
) shd_productline
ON sales.productid = shd_productline.productid
JOIN (
SELECT shd_month.id, period.id AS periodid
FROM period, shd_month
WHERE period.monthnumber = shd_month.monthnumber
AND period.monthname = shd_month.monthname
AND period.year = shd_month.year
AND period.monthyear = shd_month.monthyear
AND period.quarter = shd_month.quarter
AND period.quarteryear = shd_month.quarteryear
) shd_month
ON sales.periodid = shd_month.periodid
JOIN (
SELECT shd_salesperson.id, customer.id AS customerid
FROM customer, shd_salesperson
WHERE customer.salesperson = shd_salesperson.salesperson
) shd_salesperson
ON sales.customerid = shd_salesperson.customerid

GROUP BY shd_productline.id, shd_month.id, shd_salesperson.id

This generates the following EXPLAIN ANALYZE plan for the SELECT portion:

HashAggregate (cost=32869.33..32869.34 rows=1 width=36) (actual time=475182.855..475188.304 rows=911 loops=1)
-> Nested Loop (cost=377.07..32869.32 rows=1 width=36) (actual time=130.179..464299.167 rows=1232140 loops=1)
Join Filter: ("outer".salesperson = "inner".salesperson)
-> Nested Loop (cost=377.07..32868.18 rows=1 width=44) (actual time=130.140..411975.760 rows=1232140 loops=1)
Join Filter: ("outer".customerid = "inner".id)
-> Hash Join (cost=377.07..32864.32 rows=1 width=32) (actual time=130.072..23167.501 rows=1232140 loops=1)
Hash Cond: ("outer".productid = "inner".id)
-> Hash Join (cost=194.23..32679.08 rows=375 width=28) (actual time=83.118..14019.802 rows=1232140 loops=1)
Hash Cond: ("outer".periodid = "inner".id)
-> Seq Scan on sales (cost=0.00..26320.40 rows=1232140 width=24) (actual time=0.109..3335.275 rows=1232140 loops=1)
-> Hash (cost=194.23..194.23 rows=1 width=12) (actual time=81.548..81.548 rows=0 loops=1)
-> Hash Join (cost=4.70..194.23 rows=1 width=12) (actual time=2.544..72.798 rows=3288 loops=1)
Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname = "inner".monthname) AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND ("outer".quarter = "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear))
-> Seq Scan on period (cost=0.00..90.88 rows=3288 width=54) (actual time=0.009..9.960 rows=3288 loops=1)
-> Hash (cost=3.08..3.08 rows=108 width=58) (actual time=1.643..1.643 rows=0 loops=1)
-> Seq Scan on shd_month (cost=0.00..3.08 rows=108 width=58) (actual time=0.079..0.940 rows=108 loops=1)
-> Hash (cost=182.18..182.18 rows=265 width=12) (actual time=45.431..45.431 rows=0 loops=1)
-> Hash Join (cost=1.23..182.18 rows=265 width=12) (actual time=1.205..40.216 rows=1932 loops=1)
Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category = "inner".category) AND ("outer".platform = "inner".platform))
-> Seq Scan on product (cost=0.00..149.32 rows=1932 width=32) (actual time=0.013..6.179 rows=1932 loops=1)
-> Hash (cost=1.13..1.13 rows=13 width=45) (actual time=0.199..0.199 rows=0 loops=1)
-> Seq Scan on shd_productline (cost=0.00..1.13 rows=13 width=45) (actual time=0.048..0.083 rows=13 loops=1)
-> Seq Scan on customer (cost=0.00..2.83 rows=83 width=20) (actual time=0.005..0.174 rows=83 loops=1232140)
-> Seq Scan on shd_salesperson (cost=0.00..1.06 rows=6 width=24) (actual time=0.004..0.019 rows=6 loops=1232140)
Total runtime: 475197.372 ms
(25 rows)

Note that the estimated number of input rows to the final HashAggreggate
is 1 while the actual number is 1.2 million. By rewriting the JOIN
conditions to LEFT JOIN we force the planner to recognize that there will
be a match for every row in the sales table:

HashAggregate (cost=74601.88..74644.00 rows=8424 width=36) (actual time=39956.115..39961.507 rows=911 loops=1)
-> Hash Left Join (cost=382.43..59200.13 rows=1232140 width=36) (actual time=140.879..30765.373 rows=1232140 loops=1)
Hash Cond: ("outer".customerid = "inner".id)
-> Hash Left Join (cost=377.07..40712.67 rows=1232140 width=32) (actual time=136.069..22721.760 rows=1232140 loops=1)
Hash Cond: ("outer".periodid = "inner".id)
-> Hash Left Join (cost=182.84..34353.99 rows=1232140 width=28) (actual time=50.815..14742.610 rows=1232140 loops=1)
Hash Cond: ("outer".productid = "inner".id)
-> Seq Scan on sales (cost=0.00..26320.40 rows=1232140 width=24) (actual time=0.099..4490.148 rows=1232140 loops=1)
-> Hash (cost=182.18..182.18 rows=265 width=12) (actual time=49.114..49.114 rows=0 loops=1)
-> Hash Join (cost=1.23..182.18 rows=265 width=12) (actual time=1.331..43.662 rows=1932 loops=1)
Hash Cond: (("outer".productline = "inner".productline) AND ("outer".category = "inner".category) AND ("outer".platform = "inner".platform))
-> Seq Scan on product (cost=0.00..149.32 rows=1932 width=32) (actual time=0.128..11.246 rows=1932 loops=1)
-> Hash (cost=1.13..1.13 rows=13 width=45) (actual time=0.200..0.200 rows=0 loops=1)
-> Seq Scan on shd_productline (cost=0.00..1.13 rows=13 width=45) (actual time=0.047..0.081 rows=13 loops=1)
-> Hash (cost=194.23..194.23 rows=1 width=12) (actual time=83.651..83.651 rows=0 loops=1)
-> Hash Join (cost=4.70..194.23 rows=1 width=12) (actual time=2.675..74.693 rows=3288 loops=1)
Hash Cond: (("outer".monthnumber = "inner".monthnumber) AND ("outer".monthname = "inner".monthname) AND ("outer"."year" = "inner"."year") AND ("outer".monthyear = "inner".monthyear) AND ("outer".quarter = "inner".quarter) AND ("outer".quarteryear = "inner".quarteryear))
-> Seq Scan on period (cost=0.00..90.88 rows=3288 width=54) (actual time=0.118..12.126 rows=3288 loops=1)
-> Hash (cost=3.08..3.08 rows=108 width=58) (actual time=1.658..1.658 rows=0 loops=1)
-> Seq Scan on shd_month (cost=0.00..3.08 rows=108 width=58) (actual time=0.081..0.947 rows=108 loops=1)
-> Hash (cost=5.15..5.15 rows=83 width=12) (actual time=3.131..3.131 rows=0 loops=1)
-> Hash Join (cost=1.07..5.15 rows=83 width=12) (actual time=1.937..2.865 rows=83 loops=1)
Hash Cond: ("outer".salesperson = "inner".salesperson)
-> Seq Scan on customer (cost=0.00..2.83 rows=83 width=20) (actual time=0.137..0.437 rows=83 loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=24) (actual time=0.152..0.152 rows=0 loops=1)
-> Seq Scan on shd_salesperson (cost=0.00..1.06 rows=6 width=24) (actual time=0.045..0.064 rows=6 loops=1)
Total runtime: 39974.236 ms
(27 rows)

Given better row estimates the resulting plan runs more than ten times
faster. Why is the planner doing so poorly with estimating the number of
rows returned? I tried:

SET default_statistics_target = 1000;
VACUUM FULL ANALYZE;

but the results were the same. This is on 8.0beta4. Any ideas?

Kris Jurka

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message F. Senault 2004-11-16 10:55:59 Re: mis-estimation on data-warehouse aggregate creation
Previous Message Tom Lane 2004-11-15 23:30:41 Re: Performance difference: SELECT from VIEW or not?