Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>
Cc: Zeugswetter Andreas IZ5 <Andreas(dot)Zeugswetter(at)telecom(dot)at>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
Date: 1999-07-28 23:44:21
Message-ID: 25713.933205461@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us> writes:
>> BTW, this argument proves rigorously that the selectivity of a search
>> for any value other than the MFOV is not more than 0.5, so there is some
>> basis for my intuition that eqsel should not return a value above 0.5.
>> So, in the cases where eqsel does not know the exact value being
>> searched for, I'd still be inclined to cap its result at 0.5.

> I don't follow this. If the most frequent value occurs 95% of the time,
> wouldn't the selectivity be 0.95?

If you are searching for the most frequent value, then the selectivity
estimate should indeed be 0.95. If you are searching for anything else,
the selectivity estimate ought to be 0.05 or less. If you don't know
what value you will be searching for, which number should you use?

The unsupported assumption here is that if the table contains 95%
occurrence of a particular value, then the odds are also 95% (or at
least high) that that's the value you are searching for in any given
query that has an "= something" WHERE qual.

That assumption is pretty reasonable in some cases (such as your
example earlier of "WHERE state = 'PA'" in a Pennsylvania-local
database), but it falls down badly in others, such as where the
most common value is NULL or an empty string or some other indication
that there's no useful data. In that sort of situation it's actually
pretty unlikely that the user will be searching for field =
most-common-value ... but the system probably has no way to know that.

I wonder whether it would help to add even more data to pg_statistic.
For example, suppose we store the fraction of the columns that are NULL,
plus the most frequently occurring *non null* value, plus the fraction
of the columns that are that value. This would allow us to be very
smart about columns in which "no data" is represented by NULL (as a good
DB designer would do):

selectivity of "IS NULL": NULLfraction

selectivity of "IS NOT NULL": 1 - NULLfraction

selectivity of "= X" for a known non-null constant X:
if X == MFOV: MFOVfraction
else: MIN(MFOVfraction, 1-MFOVfraction-NULLfraction)

selectivity of "= X" when X is not known a priori, but presumably is not
null:
MIN(MFOVfraction, 1-NULLfraction)

Both of the MIN()s are upper bounds, so multiplying them by a
fudge-factor < 1 would be reasonable.

These rules would guarantee small selectivity values when either
MFOVfraction or 1-NULLfraction is small. It still wouldn't cost
much, since I believe VACUUM ANALYZE is counting nulls already...

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 1999-07-29 00:07:34 Re: [HACKERS] pg_dump not dumping all tables
Previous Message Tom Lane 1999-07-28 23:16:59 Re: [HACKERS] pg_dump not dumping all tables