issue with query optimizer when joining two partitioned tables

From: Anish Kejariwal <anishkej(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: issue with query optimizer when joining two partitioned tables
Date: 2011-07-08 21:36:48
Message-ID: CAOpcnr_nx2OAoyjJknUdeJsvFame=_K0bfKgzTjVq4G+=7yxZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have run into issue where the query optimizer is choosing the wrong
execution plan when I'm trying to join two large tables that have been
partitioned. I would really appreciate it if someone could help me out
this. I don't know whether I've found a bug in the optimizer, or whether
there is some parameter/option I need to set in postgres. Below, I've
included my execution plans. I'm using postgres 9.0.3, and I'm running this
on a pretty beefy Linux server.

My two tables:
-widget: has 4041866 records, and is broken up into 4 partitions (no records
are in the parent table).
-icecream: I'm starting with zero records, but since this there could be
billions of ice-cream records, I will partition and will not have any
records in the parent table.

So, then I then create my first partition in icecream table, and load
4041866 records into it.

Here is the query I'm using to join the two tables:
explain analyze
SELECT
r.widget_id,
r.widget_type_id,
avg(rc.cost)::double precision cost_avg
FROM
widget r,
icecream rc
WHERE
r.widget_type_id = 4
and r.widgetset_id = 5
AND r.widget_id = rc.widget_id
and rc.dataset_id = 281
group by r.widget_id,r.chromosome, r.start_pos, r.end_pos,r.widget_type_id
;

Here is the corresponding execution plan:

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=147262.20..147299.12 rows=1136 width=41) (actual
time=31876.290..31904.880 rows=11028 loops=1)
-> Merge Join (cost=95574.83..112841.79 rows=1147347 width=41) (actual
time=31130.870..31832.922 rows=11028 loops=1)
Merge Cond: (r.widget_id = rc.widget_id)
-> Sort (cost=1913.89..1942.27 rows=11352 width=21) (actual
time=56.818..68.701 rows=11028 loops=1)
Sort Key: r.widget_id
Sort Method: quicksort Memory: 1246kB
-> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
time=0.139..40.513 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75
rows=1 width=48) (actual time=0.030..0.030 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.023..0.023 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using
widget_part_5_widget_widget_type_id_idx on widget_part_5 r
(cost=0.00..1136.55 rows=11351 width=21) (actual time=0.106..18.489
rows=11028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Sort (cost=93660.94..93711.47 rows=20214 width=24) (actual
time=29730.522..30766.354 rows=946140 loops=1)
Sort Key: rc.widget_id
Sort Method: external sort Disk: 165952kB
-> Append (cost=0.00..92215.33 rows=20214 width=24) (actual
time=0.057..13731.204 rows=4041866 loops=1)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (dataset_id = 281)
-> Seq Scan on icecream_part_281 rc
(cost=0.00..92192.33 rows=20209 width=24) (actual time=0.051..5427.730
rows=4041866 loops=1)
Filter: (dataset_id = 281)
Total runtime: 33182.945 ms
(24 rows)

The query is doing a merge join, is taking 33 seconds, but should take less
than a second. So, then I do: select * from icecream;

Now, when I run the same query again, I get a different and correct
execution plan (nested loop), and the query takes less than 1 second as I
would expect.

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=7223611.41..7223648.33 rows=1136 width=41) (actual
time=392.822..420.166 rows=11028 loops=1)
-> Nested Loop (cost=4.28..341195.22 rows=229413873 width=41) (actual
time=0.231..331.800 rows=11028 loops=1)
Join Filter: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
time=0.051..50.181 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.013..0.013 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1136.55 rows=11351 width=21) (actual
time=0.033..21.254 rows=11028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Append (cost=0.00..29.88 rows=6 width=24) (actual
time=0.014..0.018 rows=1 loops=11028)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.001..0.001 rows=0 loops=11028)
Filter: (rc.dataset_id = 281)
-> Index Scan using icecream_part_281_widget_id_idx on
icecream_part_281 rc (cost=0.00..6.88 rows=1 width=24) (actual
time=0.009..0.010 rows=1 loops=11028)
Index Cond: (rc.widget_id = r.widget_id)
Filter: (rc.dataset_id = 281)
Total runtime: 431.935 ms
(19 rows)

My guess as to what happened:
-because the icecream parent table has zero records, the query optimizer
chooses the incorrect execution plan
-when I do select * from icecream, the optimizer now knows how many records
are really in the icecream table, by knowing that the icecream table has
partitions.

Next, if I run vacuum analyze on the parent table, I again get a wrong/slow
execution plan (this time it uses the hash join). Again, I think this is
because the parent table itself has zero records.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=191926.03..191962.95 rows=1136 width=41) (actual
time=28967.567..28994.395 rows=11028 loops=1)
-> Hash Join (cost=166424.79..191585.47 rows=11352 width=41) (actual
time=28539.196..28917.830 rows=11028 loops=1)
Hash Cond: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
time=0.054..54.068 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.013..0.013 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1136.55 rows=11351 width=21) (actual
time=0.035..22.419 rows=11028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Hash (cost=92214.73..92214.73 rows=4041823 width=24) (actual
time=28438.419..28438.419 rows=4041866 loops=1)
Buckets: 524288 Batches: 2 Memory Usage: 118449kB
-> Append (cost=0.00..92214.73 rows=4041823 width=24)
(actual time=0.020..14896.908 rows=4041866 loops=1)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (dataset_id = 281)
-> Seq Scan on icecream_part_281 rc
(cost=0.00..92191.73 rows=4041818 width=24) (actual time=0.012..5718.592
rows=4041866 loops=1)
Filter: (dataset_id = 281)
Total runtime: 29007.937 ms
(20 rows)

select * from icecream does not fix this issue.

I could of course disable hash join and merge join to force postgres to use
a nested loop, but my system is often joining these two tables, and I'd
rather not have to set this in every single place.
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;

thanks in advance!!!

Anish

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jonathan 2011-07-08 22:23:41 Re: Slow query when using ORDER BY *and* LIMIT
Previous Message Shaun Thomas 2011-07-08 20:20:31 Just a note about column equivalence disarming the planner