Zeugswetter Andreas IZ5 <Andreas(dot)Zeugswetter(at)telecom(dot)at> writes:
> Other db's usually use the value count(*) / nunique for the light
> weight statistics. This makes the assumptoin that the distinct index
> values are evenly distributed. That is on average a correct
> assumption, whereas our assumption on average overestimates the number
> of rows returned. I am not sure we have a nunique info though.
We don't, and AFAICS it would be an expensive statistic to compute.
I have thought about this a little more overnight, and I have come up
with what I think is a better idea. Suppose that VACUUM ANALYZE stores
in pg_statistic not only the disbursion, but also the most frequently
occurring value of each column. It already computes (or I should say
estimates) the most frequently occurring value (MFOV) in order to arrive
at the disbursion, so storing the value costs nothing except a little
more space in pg_statistic. Now, the logic that eqsel() should use is
if constant-being-compared-against == MFOV then
return MIN(disbursion, 1.0 - disbursion);
which works like this: if we are indeed looking for the MFOV then the
selectivity is just the disbursion, no question. If we are looking for
a value *other* than the MFOV, then the selectivity must be less than
the disbursion, since surely this value occurs less often than the MFOV.
But the total fraction of non-MFOV values in the table is
1.0-disbursion, so the fraction that are the specific value we want
can't exceed that either.
The MIN() above is therefore a hard upper bound for the selectivity
of the non-MFOV case. In practice we might want to multiply the MIN
by a fudge-factor somewhat less than one, to arrive at what we hope
is a reasonable estimate rather than a worst-case estimate.
BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.
If we use this logic, the stat we really want is exactly the frequency
of the MFOV, not the disbursion which is just closely related to it.
I have not looked at the other uses of disbursion, but if they all can
work like this we might want to forget the statistical niceties and just
store the frequency of the MFOV.
A final comment is that NULL would be treated just like any regular
value in determining what is the MFOV...
regards, tom lane
pgsql-hackers by date
|Next:||From: Bernard Frankpitt||Date: 1999-07-28 15:02:41|
|Subject: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)|
|Previous:||From: Thomas Lockhart||Date: 1999-07-28 14:51:17|
|Subject: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)|