> Robert explained why having more MCVs might be useful because we use
> the frequency of the least common MCV as an upper bound on the
> frequency of any value in the MCV.
Where is that being used? The only non-MCV frequency estimate that I
recall seeing is ( nrows - n_ndistinct_rows )/ndistinct. Obviously
changing the number of mcv's affects this by lowering
n_ndistinct_rows, but it's always pretty coarse estimate.
> Binding the length of the MCV list to the size of the histogram is
> arbitrary but so would any other value
Wouldn't the best approach be to stop adding MCV's/histogram buckets
when adding new ones doesn't decrease your prediction error
One very hacky threshold heuristic is to stop adding MCV's when a
simple equality select ( SELECT col FROM table WHERE col == VALUE )
would switch the plan from an index to a sequential scan ( or vice
versa, although with the current code this would never happen ). ie,
if the non_mcv frequency estimate is 0.1% ( producing an index scan ),
but adding the MCV gives us an estimate of 5% ( pbly producing a seq
scan ) then add that value as an mcv. More sophisticated variations
might also consider plan changes to very suboptimal joins; even more
sophisticated would be to stop when the MAX( curr - optimal plan /
optimal plan ) was below some threshold, say 20%, over a bunch of
recently executed queries.
A similar approach would work for histogram bins, except the queries
of interest are inequality rather than equality selections.
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2010-10-21 01:54:19|
|Subject: Re: max_wal_senders must die|
|Previous:||From: Robert Haas||Date: 2010-10-21 01:49:23|
|Subject: Re: default_statistics_target WAS: max_wal_senders must die|