Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Make the qual cost on index Filter slightly higher than qual cost on index Cond.
Date: 2020-05-27 13:58:04
Message-ID: CAKU4AWrkL+1brQ1Sd-GotV=8=gYaCNARurPkJek76m9sPwJOGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 27, 2020 at 8:01 PM Ashutosh Bapat <
ashutosh(dot)bapat(at)2ndquadrant(dot)com> wrote:

>
>
> On Wed, 27 May 2020 at 04:43, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
>
>> You can use the attached sql to reproduce this issue, but I'm not sure
>> you can
>> get the above result at the first time that is because when optimizer
>> think the
>> 2 index scan have the same cost, it will choose the first one it found,
>> the order
>> depends on RelationGetIndexList. If so, you may try drop and create
>> j1_i_im5 index.
>>
>> The sense behind this patch is we still use the cost based optimizer,
>> just when we
>> we find out the 2 index scans have the same cost, we prefer to use the
>> index which
>> have more qual filter on Index Cond. This is implemented by adjust the
>> qual cost
>> on index filter slightly higher.
>>
>
> Thanks for the example and the explanation.
>
> The execution time difference in your example is pretty high to account
> for executing the filter on so many rows. My guess is this has to do with
> the heap access. For applying the filter the entire row needs to be fetched
> from the heap. So we should investigate this case from that angle. Another
> guess I have is the statistics is not correct and hence the cost is wrong.
>
>
I believe this is a statistics issue and then the cost is wrong. More
characters of this
issue are: 1). If a data is out of range in the old statistics,
optimizer will given an 1 row
assumption. 2). based on the 1 row assumption, for query
"col1=out_of_range_val AND
col2 = any_value" Index (col1, col2) and (col1, col3) will have exactly
same cost for current
cost model. 3). If the statistics was wrong, (col1, col3) maybe a very bad
plan as shown
above, but index (col1, col2) should always better/no worse than (col1,
col3) in any case.
4). To expand the rule, for query "col1 = out_of_range_val AND col2 =
any_value AND col3 = any_val",
index are (col1, col2, col_m) and (col1, col_m, col_n), the former index
will aways has better/no worse
than the later one. 5). an statistics issue like this is not uncommon,
for example
an log based application, creation_date is very easy to out of range in
statistics.

so we need to optimize the cost model for such case, the method is the
patch I mentioned above.
I can't have a solid data to prove oracle did something similar, but based
on the talk with my
customer, oracle is likely did something like this.

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2020-05-27 13:59:17 Re: password_encryption default
Previous Message Jonathan S. Katz 2020-05-27 13:54:47 Re: password_encryption default