Query planner plus partitions equals very bad plans, again

From: Scott Carey <scott(at)richrelevance(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Query planner plus partitions equals very bad plans, again
Date: 2008-12-21 20:08:36
Message-ID: C573E244.1577%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Here is a query on a partitioned schema that produces a very bad query plan. The tables are fully vacuumed, analyzed with stats target 40, and no bloat (created with pure inserts, no updates or deletes).

I already know of at least three bugs with the query planner and partitions listed at the end of this message.

Below are explain analyze examples on just one partition. Expanding the date range to be multiple partitions only makes this worse.

First, the query written to go against the direct partitions, which is fast and does the right plan:
(although, wouldn't it use less memory if it hashed the small table, and scanned the larger one against that rather than vice-versa? Or am I reading the explain wrong?)

rr=> explain analyze select pl.id, sl.terms from p_pp_logs_026_2008_12_17 pl, p_s_logs_026_2008_12_17 sl where sl.session = pl.session and sl.terms <> 'null' and sl.terms is not null; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=31470.89..37199.26 rows=38597 width=27) (actual time=743.782..772.705 rows=29907 loops=1)
Hash Cond: ((pl.session)::text = (sl.session)::text)
-> Seq Scan on pp_logs_026_2008_12_17 pl (cost=0.00..357.49 rows=8449 width=46) (actual time=0.007..3.064 rows=8449 loops=1)
-> Hash (cost=24822.35..24822.35 rows=531883 width=57) (actual time=743.597..743.597 rows=531453 loops=1)
-> Seq Scan on s_logs_026_2008_12_17 sl (cost=0.00..24822.35 rows=531883 width=57) (actual time=0.011..392.242 rows=532067 loops=1)
Filter: ((terms IS NOT NULL) AND ((terms)::text <> 'null'::text))
Total runtime: 779.431 ms

In the above, the estimated and actual rows are about right, only a bit off after the join.

We are partitioned by s_id and date, the general form of the query used for scanning more than one combination of these values fails.
In the below, which restricts the data to the same tables, the planner is extremely wrong (plus the parent tables which have one irrelevant row of dummy data so that analyze works on them):

explain analyze select pl.id, sl.terms from pp_logs pl, s_logs sl where pl.s_id = 26 and sl.s_id = 26 and sl.session = pl.session and sl.terms <> 'null' and sl.terms is not null and pl.date = '2008-12-17' and sl.date = '2008-12-17';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=79036.43..416160.16 rows=22472099 width=27) (actual time=7089.961..7839.665 rows=29907 loops=1)
Merge Cond: ((pl.session)::text = (sl.session)::text)
-> Sort (cost=968.98..990.10 rows=8450 width=46) (actual time=193.778..196.690 rows=8449 loops=1)
Sort Key: pl.session
Sort Method: quicksort Memory: 1043kB
-> Append (cost=0.00..417.84 rows=8450 width=46) (actual time=36.824..133.157 rows=8449 loops=1)
-> Seq Scan on pp_logs pl (cost=0.00..18.10 rows=1 width=21) (actual time=0.056..0.056 rows=0 loops=1)
Filter: ((s_id = 26) AND (date = '2008-12-17'::date))
-> Seq Scan on pp_logs_026_2008_12_17 pl (cost=0.00..399.74 rows=8449 width=46) (actual time=36.766..129.317 rows=8449 loops=1)
Filter: ((s_id = 26) AND (date = '2008-12-17'::date))
-> Sort (cost=78067.45..79397.16 rows=531884 width=57) (actual time=6895.648..7094.701 rows=552412 loops=1)
Sort Key: sl.session
Sort Method: quicksort Memory: 92276kB
-> Append (cost=0.00..27483.28 rows=531884 width=57) (actual time=33.759..1146.936 rows=532067 loops=1)
-> Seq Scan on s_logs sl (cost=0.00..1.19 rows=1 width=22) (actual time=0.036..0.036 rows=0 loops=1)
Filter: ((terms IS NOT NULL) AND ((terms)::text <> 'null'::text) AND (s_id = 26) AND (date = '2008-12-17'::date))
-> Seq Scan on s_logs_026_2008_12_17 sl (cost=0.00..27482.09 rows=531883 width=57) (actual time=33.721..903.607 rows=532067 loops=1)
Filter: ((terms IS NOT NULL) AND ((terms)::text <> 'null'::text) AND (s_id = 26) AND (date = '2008-12-17'::date))
Total runtime: 7861.723 ms

Some partitions have about 10x the data as the above, resulting in ~30x longer times to do the sort, but only 10x to hash. The sort also uses far, far more memory than the hash should if it were to hash the small one and hash-join the large table against that.

As you can see in the above example, this inner join, which the planner knows has two arms of row count ~530K and ~8.5K, is predicted to have 22M output rows. What?
Did it loose its ability to even roughly know n_distinct? Can it use n_distinct for each table and combine them for a (weighted) worst case estimate and arrive at something less bizarre?
If I scan across more tables, the actual output grows roughly linearly, but the estimate looks like its A*ln(A*B) where A is the size of the larger arm and B is the size of the smaller one. Having an inner join result in more rows than one of the arms is difficult with n_distinct numbers like these as evidenced by the direct to table query estimate.
Perhaps the parent table (with one row) is getting in the way?

Any suggestions on how to get a good plan when accessing this via the parent table to scan multiple dates?

Known (to me) partition bugs related to the query planner / stats:
* Row width estimates for scans across multiple tables are wrong (just max(each table width) rather than a weighted average - this could have been fixed in 8.3.4+ without me noticing though).
* Column selectivity not combined rationally for aggregate statistics on multiple table scans in an arm. Example, two partitions may have an int column that has only 30 distinct values over 2M rows in each partition. The planner will decide that the selectivity is 10% rather than the worst case (30+30)/(total rows) or best case (30/(total rows)). There are many variations on this theme not just the above type of error.
* An empty partition (usually the parent table) will propagate default statistics through the errors above leading to even more strange query plans. Empty table default statistics are generally poor, but worse in the cases above since they override good stats. Furthermore, ANALYZE won't write stats for an empty table so unless you put dummy rows in a table, you are guaranteed to get bad plans.

I have put a single dummy row in each parent table with minimum column width on text columns and analyzed it to help out on the above. All partitions are fully vacuumed with no bloat and fully analyzed. After a day, they become read only.

The only time I wish I have hints is when I'm working on partitioned tables, most other times minor query modifications plus good statistics get a good plan (but perhaps a less natural query syntax). The planner is not adequate on partitioned tables.

Browse pgsql-performance by date

  From Date Subject
Next Message Guy Rouillier 2008-12-22 03:31:33 Re: PostgreSQL vs Oracle
Previous Message Stefano Dal Pra 2008-12-21 10:15:12 Re: PostgreSQL vs Oracle