Re: proposal : cross-column stats

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-12 16:33:24
Message-ID: 7DE5DDDD-1E3B-447A-AD43-D63BD0420FCE@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
> The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivity of "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample.

The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the assumption or knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states that

P(A|B) = P(A and B) / P(B).

Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3) does not change under additional assumptions like b=4. Bayes' theorem thus becomes

P(A) = P(A and B) / P(B) <=>
P(A and B) = P(A)*P(B)

which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".

I believe that measuring this by counting the number of distinct values of column B for each A is basically the right idea. Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and compare that to the overall number of distinct values of "b"...

A (very) quick search on scholar.google.com for "estimate conditional probability" didn't turn up anything useful, but it's hard to believe that there isn't at least some literature on the subject.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-12-12 16:44:07 Re: function attributes
Previous Message Andrew Dunstan 2010-12-12 16:24:02 Re: function attributes