Re: BUG #5059: Planner ignores estimates when planning an IN () subquery

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kenaniah Cerny <kenaniah(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5059: Planner ignores estimates when planning an IN () subquery
Date: 2009-09-17 02:35:28
Message-ID: 603c8f070909161935t68266cdt2cf1c10803b8ccd7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Sep 16, 2009 at 6:39 PM, Kenaniah Cerny <kenaniah(at)gmail(dot)com> wrote:
> I can provide the output of statistics queries if you would like. Just let
> me know which statistics you want and I'll pastebin them.
>
> As far as selectivity goes, the selectivity estimate for the
> user_anime_log.user_account_id was definitely miscalculated. The
> user_anime_log contains up to 15 entries per user (70,000 users, but only
> 475,811 rows). The default statistics target on that relation is set to
> 1000. But even with poor statistics, guessing 62,000 rows when there's a
> maximum of 15 per user account is still quite a miss.

You've changed the table name on me, vs. what you pasted in the query,
which had a user_activity_log but no user_anime_log...

> Analysis of SELECT * FROM user_activity_log WHERE user_account_id = 17238;
> estimates 13 rows and returns 15, which is quite reasonable considering the
> statistics targets.
>
> Please forgive my ignorance, but in the case of my subquery, is the
> estimated number of rows and cost being taken into account (or only the
> selectivity)?

Well, selectivity is just a term that refers to the fraction of rows
that match some condition (rows themselves do not have selectivity).
Usually the initial estimating is done in terms of selectivity, which
is then multiplied by the total number of rows to find the number of
rows that will remain after the condition is applied.

So, yes, rows and cost are taken into account. The problem here is
that the planner is mis-estimating the selectivity, therefore it
computes the wrong number of rows (way too high), therefore it makes
the wrong decision.

> Granted I don't understand much about the planner internals,
> but it seems strange that a nested loop semi join would be chosen when the
> inner table is estimated to return an extremely low number of rows with low
> cost and low selectivity. Shouldn't the planner also estimate the cost of an
> inner (er, left?) join in that scenario?

Well... you can't replace a semi join with an inner or left join,
because it doesn't do the same thing. You could use a hash semi join
or merge join semi join, but that doesn't make sense if, as you say,
the inner table is estimated to return an extremely low number of
rows.

It might be a bit easier to analyze this if you stripped out all the
joins that aren't necessary to reproduce the problem. Also, I would
prefer EXPLAIN ANALYZE output posted in-line to the mailing list
rather than pasted to a separate web site - it screws up the
formatting.

But honestly I'm not sure how much time it's worth spending on this.
You have a way to rewrite the query that works... and fixing the
estimator is going to be hard... so I suggest doing it the way that
works, and moving on!

...Robert

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Craig Ringer 2009-09-17 03:15:18 Re: 답장: bug report
Previous Message Tom Lane 2009-09-17 02:31:15 Re: ERROR: invalid page header in block 502758 of relation base/16388/16403