## Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From: Tom Lane Zeugswetter Andreas IZ5 pgsql-hackers(at)postgreSQL(dot)org Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect) 1999-07-28 14:57:00 9873.933173820@sss.pgh.pa.us (view raw, whole thread or download thread mbox) 1999-07-28 14:57:00 from Tom Lane  1999-07-28 16:06:22 from Bruce Momjian pgsql-hackers
Zeugswetter Andreas IZ5 <Andreas(dot)Zeugswetter(at)telecom(dot)at> writes:
> Other db's usually use the value count(*) / nunique for the light
> weight statistics.  This makes the assumptoin that the distinct index
> values are evenly distributed.  That is on average a correct
> assumption, whereas our assumption on average overestimates the number
> of rows returned.  I am not sure we have a nunique info though.

We don't, and AFAICS it would be an expensive statistic to compute.

with what I think is a better idea.  Suppose that VACUUM ANALYZE stores
in pg_statistic not only the disbursion, but also the most frequently
occurring value of each column.  It already computes (or I should say
estimates) the most frequently occurring value (MFOV) in order to arrive
at the disbursion, so storing the value costs nothing except a little
more space in pg_statistic.  Now, the logic that eqsel() should use is

if constant-being-compared-against == MFOV then
return disbursion;
else
return MIN(disbursion, 1.0 - disbursion);

which works like this: if we are indeed looking for the MFOV then the
selectivity is just the disbursion, no question.  If we are looking for
a value *other* than the MFOV, then the selectivity must be less than
the disbursion, since surely this value occurs less often than the MFOV.
But the total fraction of non-MFOV values in the table is
1.0-disbursion, so the fraction that are the specific value we want
can't exceed that either.

The MIN() above is therefore a hard upper bound for the selectivity
of the non-MFOV case.  In practice we might want to multiply the MIN
by a fudge-factor somewhat less than one, to arrive at what we hope
is a reasonable estimate rather than a worst-case estimate.

BTW, this argument proves rigorously that the selectivity of a search
for any value other than the MFOV is not more than 0.5, so there is some
basis for my intuition that eqsel should not return a value above 0.5.
So, in the cases where eqsel does not know the exact value being
searched for, I'd still be inclined to cap its result at 0.5.

If we use this logic, the stat we really want is exactly the frequency
of the MFOV, not the disbursion which is just closely related to it.
I have not looked at the other uses of disbursion, but if they all can
work like this we might want to forget the statistical niceties and just
store the frequency of the MFOV.

A final comment is that NULL would be treated just like any regular
value in determining what is the MFOV...

regards, tom lane

### pgsql-hackers by date

 Next: From: Bernard Frankpitt Date: 1999-07-28 15:02:41 Subject: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select) Previous: From: Thomas Lockhart Date: 1999-07-28 14:51:17 Subject: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)