Re: Correlation in cost_index()

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Correlation in cost_index()
Date: 2003-08-08 23:18:55
Message-ID: f298jvkfqeejig0ed08rpti9nf0754o3eh@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 8 Aug 2003 15:10:06 -0700, Sean Chittenden
<sean(at)chittenden(dot)org> wrote:
>> Yes, we knew that already. Oliver had suggested simply dropping the
>> division by nKeys, thus pretending that the first-column correlation
>> is close enough. That seems to me to be going too far in the other
>> direction,
>
>But is it really?

Yes. I once posted two use cases showing that one side is as wrong as
the other. Wait a momemt ... Here it is:
http://archives.postgresql.org/pgsql-hackers/2002-10/msg00229.php.

>indexCorrelation is 1.0 for the 1st key in a multi-column index. [...]

Unfortunately there are cases, where the correlation of the first
column doesn't tell us anything about the real index correlation.

> An indexCorrelation
>of 0.0 doesn't mean that the index is useless either, just that we
>take the full hit of a completely random page read as opposed to some
>fraction of a random page cost.

... and that we cannot expect the tuples we are looking for to be on
the same page or on pages near to each other. So we have to read a
higher number of pages.

> indexCorrelation: 0.183380
> Seq Scan on report_user_cat_count rucc (cost=0.00..21472.69 rows=31893 width=64) (actual time=444.25..2489.27 rows=514 loops=1)
>
> Index Scan using report_user_cat_count_html_bytes_idx on report_user_cat_count rucc (cost=0.00..112165.07 rows=31893 width=64) (actual time=68.64..85.75 rows=514 loops=1)

The main problem here is the bad estimation for number of rows: 31893
vs. 514. The seq scan cost depends on the size of the heap relation
and is always the same (~ 21000) for different search conditions.
With low correlation, index scan cost is roughly proportional to the
number of tuples fetched (ignoring the effects of effective_cache_size
for now). So for a correct guess of number of rows we'd get

seq idx
estimated 21000 1800
actual 2500 86
ratio 8.4 20.4

As estimated and actual are not given in the same units, we look at
the estimated/actual ratio and see that index scan cost is
over-estimated by a factor 2.5. index_cost_algorithm 4 (geometric
interpolation) should get this right. Can you ANALYSE with a higher
statistics_target and then test with different index_cost_algorithms?

Servus
Manfred

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Manfred Koizar 2003-08-08 23:48:52 Re: Correlation in cost_index()
Previous Message Joe Conway 2003-08-08 22:45:32 Re: [HACKERS] IS OF