Re: Correlation in cost_index()

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Manfred Koizar <mkoi-pg(at)aon(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Correlation in cost_index()
Date: 2003-08-08 22:10:06
Message-ID: 20030808221006.GQ94710@perrin.int.nxad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > Which suggests to me that line 3964 in
> > ./src/backend/utils/adt/selfuncs.c isn't right for multi-column
> > indexes, esp for indexes that are clustered. I don't know how to
> > address this though... Tom, any hints?
>
> Yes, we knew that already. Oliver had suggested simply dropping the
> division by nKeys, thus pretending that the first-column correlation
> is close enough. That seems to me to be going too far in the other
> direction,

But is it really?

> xbut clearly dividing by nKeys is far too pessimistic. I'd change
> this in a moment if someone could point me to a formula with any
> basis at all ...

Got it, alright. I'd never paid attention to prior discussions as the
planner had generally did the right thing (with a lowered
random_page_cost ::grin::). In terms of statistics and setting
indexCorrelation correctly, something like Spearman's rho calculation
comes to mind, though I don't know how applicable that is to database
theory.

indexCorrelation is 1.0 for the 1st key in a multi-column index. The
only thing different about a multi-column index and a single column
index is the multi-column index takes up more space per key, resulting
in fewer index entries per page and more pages being fetched than
would be in a single column index, but the current cost_index()
function takes increased number of page fetches into account when
calculating cost. As things stand, however, if a multi-column key is
used, the indexCorrelation is penalized by the size of the number of
keys found in the multi-column index. As things stand the qual
user_id = 42, on a CLUSTER'ed multi-column index (user_id,utc_date)
has an indexCorrelation of 0.5, when in fact the correlation is 1.0.
indexCorrelation == number of random page fetches, which could be next
to free on a solid state drive, in this case, the page fetches aren't
random, they're perfectly sequential. If it were 'user_id = 42 AND
utc_date = NOW()', the correlation of a lookup of the user_id would
still be 1.0 and the utc_date would be 1.0 because both values are
looked up in the index key. A lookup of just the utc_date can never
use the index and the planner correctly uses a sequential scan. Cost
!= Correlation. They're proportional, but not the same and
indexCorrelation is the wrong place to handle cost as that's done by
the Mackert and Lohman formula. Under what circumstances would it be
correct to pessimize the use of indexCorrelation? An indexCorrelation
of 0.0 doesn't mean that the index is useless either, just that we
take the full hit of a completely random page read as opposed to some
fraction of a random page cost.

I tossed a different index on my test table to see how well things
fare with a low correlation, and this was a bit disturbing:

# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE rucc.html_bytes > 8000000::BIGINT;
INFO: cost_seqscan: run_cost: 21472.687500
startup_cost: 0.000000

INFO: cost_index: run_cost: 112165.065458
startup_cost: 0.000000
indexCorrelation: 0.183380
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Seq Scan on report_user_cat_count rucc (cost=0.00..21472.69 rows=31893 width=64) (actual time=444.25..2489.27 rows=514 loops=1)
Filter: (html_bytes > 8000000::bigint)
Total runtime: 2492.36 msec
(3 rows)

# SET enable_seqscan = false;
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE rucc.html_bytes > 8000000::BIGINT;
INFO: cost_seqscan: run_cost: 21472.687500
startup_cost: 100000000.000000

INFO: cost_index: run_cost: 112165.065458
startup_cost: 0.000000
indexCorrelation: 0.183380
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using report_user_cat_count_html_bytes_idx on report_user_cat_count rucc (cost=0.00..112165.07 rows=31893 width=64) (actual time=68.64..85.75 rows=514 loops=1)
Index Cond: (html_bytes > 8000000::bigint)
Total runtime: 97.75 msec
(3 rows)

*shrug* A low indexCorrelation overly pessimizes the cost of an index,
but I'm not sure where to attribute this too. :-/

-sc

--
Sean Chittenden

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-08-08 22:25:41 Re: Correlation in cost_index()
Previous Message Tom Lane 2003-08-08 22:03:17 Re: [HACKERS] IS OF