Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)

From: Philip Warner <pjw(at)rhyme(dot)com(dot)au>
To: Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zeugswetter Andreas IZ5 <Andreas(dot)Zeugswetter(at)telecom(dot)at>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
Date: 1999-07-29 02:37:05
Message-ID: 3.0.5.32.19990729123705.00b59380@mail.rhyme.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At 20:21 28/07/99 -0400, you wrote:
>
>> I wonder whether it would help to add even more data to pg_statistic.
>> For example, suppose we store the fraction of the columns that are NULL,
>> plus the most frequently occurring *non null* value, plus the fraction
>> of the columns that are that value. This would allow us to be very
>> smart about columns in which "no data" is represented by NULL (as a good
>> DB designer would do):
>
>That would be nice.
>

I know I've mentioned this before, but can't the designer of the query be
given some influence over optimizer index choices? We can circle around the
problem of understanding the demographics of a table, but without
row-by-row analysis, you'll *never* get the complete and accurate view that
is needed to cater for all cases.

OTOH, a query designer often knows that a particular query will only be run
to find 'exceptions' (ie. non-nulls when 95% are nulls), or to find 'small'
ranges. IMO, when a DBA is in a position to help the optimizer, they
*should* allowed to. PG *already* has something like this in the form of
partial indexes: you can view the query that is associated with the index
as a 'hint' as to when that index should be used. All I'm asking is for
queries, not indexes, to specify when an index is used.

This will not in any way replace the optimizer, but it will give users the
ability deal with pathological cases.

In terms of the statistics collected, it *may* also be worth doing some
rudimentary analysis on the data to see it is conforms to any common
distribution (or sum of distributions), and if it does, save that
information. eg. the optimizer will do pretty well if it *knows* the data
is in a normal distribution, with a mean of 972 and a stdev of 70! Of
course, you must be sure that it *is* a normal distribution to start with.

FWIW, statisticians often seem worried about three values: the mean, median
and mode. I don't really know which is which, but they are:

o The average of all values
o The average of the min and max value
o The most common value.

Someone who knows a lot more about this stuff than me can probably tell us
how these values will affect the trust we place in the index statistics.
Someone on this list must be able to give us some insight???

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 1999-07-29 02:39:03 Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple se lect)
Previous Message The Hermit Hacker 1999-07-29 02:28:17 Re: [HACKERS] pg_dump not dumping all tables