Re: Optimization idea

From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Optimization idea
Date: 2010-04-23 19:22:08
Message-ID: i2ne94e14cd1004231222x5459b076w9d0d9077ad73a57a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2010/4/23 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Fri, Apr 23, 2010 at 9:09 AM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> 2010/4/23 Robert Haas <robertmhaas(at)gmail(dot)com>:
>>> On Thu, Apr 22, 2010 at 10:37 PM, Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru> wrote:
>>>> I don't think this is just an issue with statistics, because the same
>>>> problem arises when I try executing a query like this:
>>>
>>> I'm not sure how you think this proves that it isn't a problem with
>>> statistics, but I think what you should be focusing on here, looking
>>> back to your original email, is that the plans that are actually much
>>> faster have almost as much estimated cost as the slower one.  Since
>>> all your data is probably fully cached, at a first cut, I might try
>>> setting random_page_cost and seq_page_cost to 0.005 or so, and
>>> adjusting effective_cache_size to something appropriate.
>>
>> that will help worrect the situation, but the planner is loosing here I think.
>
> Well, what do you think the planner should do differently?

Here the planner just divide the number of rows in the t2 table by the
number of distinct value of t1.t. this is the rows=20200 we can see in
the explains.
It seems it is normal, but it also looks to me that it can be improved.
When estimating the rowcount to just num_rows/n_distinct, it *knows*
that this is wrong because the most_common_freqs of t2.t say that of
the 99600 rows have the value 1, or less than 200 in all other case.
So in every case the planner make (perhaps good) choice, but being
sure its estimation are wrong.
I wonder if we can improve the planner here.

In this case where the number of rows is lower than the stats
target(in t1.t), perhaps the planner can improve its decision by going
a bit ahead and trying plan for each n_distinct values corresponding
in t2.t .

I haven't a very clear idea of how to do that, but it may be better if
the planner estimate if its plan is 100%(or lower, just an idea) sure
to hapen and that's fine, else try another plan.

in this test case, if the query is :
select *
from t2
join t1 on t1.t = t2.t
where t1.id = X;

if X=1 then the planner has 20% of chance that the rowcount=99600 and
80% that rowcount=200 or less, by providing a rowcount=20200 how can
it find the good plan anyway ? Is it beter to start with bad
estimation and perhaps find a good plan, or start with estimation
which may be bad but lead to a good plan in more than XX% of the
cases.

So, currently, the planner do as expected, but can we try another
approach for those corner cases ?

>
> ...Robert
>

--
Cédric Villemain

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Eliot Gable 2010-04-23 20:42:57 Re: Replacing Cursors with Temporary Tables
Previous Message Kevin Grittner 2010-04-23 13:41:01 Re: Optimization idea