On Tue, Sep 15, 2009 at 11:35 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com> wrote:
> The following bug has been logged online:
> Bug reference: 5059
> Logged by: Kenaniah Cerny
> Email address: kenaniah(at)gmail(dot)com
> PostgreSQL version: 8.4.1
> Operating system: Centos5.2
> Description: Planner ignores estimates when planning an IN ()
> Consider the following query:
> When planning the subquery of the IN () statement, the planner chose to scan
> the indexes of the outer and inner columns in parallel using a nested loop
> semi join.
> By not enabling the planner to sort via the index of the outer column in the
> WHERE clause (query above), the a nested loop version of the plan executes
> in a fraction of the time.
> As you can see from the above query, forcing the materialization of the
> subquery produces a much superior plan.
> For comparison, this query replaces the subquery with hard-coded values.
> The planner appears to not be weighing the benefits of materializing the
> subquery of the IN () statement properly when ordering is involved, and
> still produces an inferior plan when ordering is not a factor.
> Please feel free to contact me for additional test cases if needed.
I've seen this kind of plan before. The planner knows that a
backwards index scan over user_activity_log is slow, but it thinks
that it won't have to scan very much of it because it will accumulate
enough rows to satisfy the LIMIT before it goes very far. When it
turns out that there are not as many matching rows in user_friends as
anticipated, things get very slow.
Unfortunately, there's not an easy solution to this problem. If it
were really true that most rows in user_activity_log had matches in
user_friends, then planner would be correct to choose the plan it did
- and wrong to choose the plan you want it to use. In this case, your
table is small enough that a bitmap heap scan on user_activity_log is
reasonable, but for a larger table in a situation where the join
selectivity is higher, it might really be necessary to use the index.
The problem is just that the planner is doing a poor job figuring out
where the breakpoint is between those two strategies.
I'm not 100% sure exactly which selectivity estimate is wrong. It's
not clear to me whether things are already wrong here:
INDEX Scan USING user_friends_user_account_id_friend_id_key ON
user_friends (cost=0.00..0.27 rows=1 width=4) (actual
time=0.004..0.004 rows=0 loops=350713)
But they're definitely wrong here:
Nested Loop Semi JOIN (cost=0.00..144564.33 rows=62298 width=52)
(actual time=5138.961..5405.075 rows=10 loops=1)
I *think* the root cause of the problem is that we don't have
multi-column statistics. When you look at user_friends, we can tell
you the selectivity of user_friends.user_account_id = 74650 pretty
accurately, and we should also be able to spit out MCVs to compare
against the MCVs of user_activity_log, but if the selectivity of those
two expressions together doesn't resemble the product of their
individual selectivities, which is probably the case here, the planner
has no way to detect that.
In response to
pgsql-bugs by date
|Next:||From: Kenaniah Cerny||Date: 2009-09-16 22:39:37|
|Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery|
|Previous:||From: Merlin Moncure||Date: 2009-09-16 20:02:02|
|Subject: 'missing parameter $1' for sql or pl/pgsql COPY|