Re: BUG #5059: Planner ignores estimates when planning an IN () subquery

From: Kenaniah Cerny <kenaniah(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-16 22:39:37
Message-ID: f647f4600909161539n1eb071d6k2f56f484432ff632@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I can provide the output of statistics queries if you would like. Just let
me know which statistics you want and I'll pastebin them.

As far as selectivity goes, the selectivity estimate for the
user_anime_log.user_account_id was definitely miscalculated. The
user_anime_log contains up to 15 entries per user (70,000 users, but only
475,811 rows). The default statistics target on that relation is set to
1000. But even with poor statistics, guessing 62,000 rows when there's a
maximum of 15 per user account is still quite a miss.

Analysis of SELECT * FROM user_activity_log WHERE user_account_id =
17238; estimates
13 rows and returns 15, which is quite reasonable considering the statistics
targets.

Please forgive my ignorance, but in the case of my subquery, is the
estimated number of rows and cost being taken into account (or only the
selectivity)? Granted I don't understand much about the planner internals,
but it seems strange that a nested loop semi join would be chosen when the
inner table is estimated to return an extremely low number of rows with low
cost and low selectivity. Shouldn't the planner also estimate the cost of an
inner (er, left?) join in that scenario?

Kenaniah

On Wed, Sep 16, 2009 at 2:32 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Sep 15, 2009 at 11:35 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com>
> wrote:
> >
> > The following bug has been logged online:
> >
> > Bug reference: 5059
> > Logged by: Kenaniah Cerny
> > Email address: kenaniah(at)gmail(dot)com
> > PostgreSQL version: 8.4.1
> > Operating system: Centos5.2
> > Description: Planner ignores estimates when planning an IN ()
> > subquery
> > Details:
> >
> > Consider the following query:
> >
> > http://pgsql.privatepaste.com/aa5DAtiwws
> >
> > When planning the subquery of the IN () statement, the planner chose to
> scan
> > the indexes of the outer and inner columns in parallel using a nested
> loop
> > semi join.
> >
> > http://pgsql.privatepaste.com/4eXj3zRcy7
> >
> > By not enabling the planner to sort via the index of the outer column in
> the
> > WHERE clause (query above), the a nested loop version of the plan
> executes
> > in a fraction of the time.
> >
> > http://pgsql.privatepaste.com/5c0bOcL3t6
> >
> > As you can see from the above query, forcing the materialization of the
> > subquery produces a much superior plan.
> >
> > http://pgsql.privatepaste.com/371nl6KFrI
> >
> > For comparison, this query replaces the subquery with hard-coded values.
> >
> > The planner appears to not be weighing the benefits of materializing the
> > subquery of the IN () statement properly when ordering is involved, and
> > still produces an inferior plan when ordering is not a factor.
> >
> > Please feel free to contact me for additional test cases if needed.
>
> I've seen this kind of plan before. The planner knows that a
> backwards index scan over user_activity_log is slow, but it thinks
> that it won't have to scan very much of it because it will accumulate
> enough rows to satisfy the LIMIT before it goes very far. When it
> turns out that there are not as many matching rows in user_friends as
> anticipated, things get very slow.
>
> Unfortunately, there's not an easy solution to this problem. If it
> were really true that most rows in user_activity_log had matches in
> user_friends, then planner would be correct to choose the plan it did
> - and wrong to choose the plan you want it to use. In this case, your
> table is small enough that a bitmap heap scan on user_activity_log is
> reasonable, but for a larger table in a situation where the join
> selectivity is higher, it might really be necessary to use the index.
> The problem is just that the planner is doing a poor job figuring out
> where the breakpoint is between those two strategies.
>
> I'm not 100% sure exactly which selectivity estimate is wrong. It's
> not clear to me whether things are already wrong here:
>
> INDEX Scan USING user_friends_user_account_id_friend_id_key ON
> user_friends (cost=0.00..0.27 rows=1 width=4) (actual
> time=0.004..0.004 rows=0 loops=350713)
>
> But they're definitely wrong here:
>
> Nested Loop Semi JOIN (cost=0.00..144564.33 rows=62298 width=52)
> (actual time=5138.961..5405.075 rows=10 loops=1)
>
> I *think* the root cause of the problem is that we don't have
> multi-column statistics. When you look at user_friends, we can tell
> you the selectivity of user_friends.user_account_id = 74650 pretty
> accurately, and we should also be able to spit out MCVs to compare
> against the MCVs of user_activity_log, but if the selectivity of those
> two expressions together doesn't resemble the product of their
> individual selectivities, which is probably the case here, the planner
> has no way to detect that.
>
> ...Robert
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Craig Ringer 2009-09-17 00:52:21 Re: BUG #5060: 8.2 bin install
Previous Message Robert Haas 2009-09-16 21:32:39 Re: BUG #5059: Planner ignores estimates when planning an IN () subquery