Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-12 16:23:20
Message-ID: 4D04F6F8.1040403@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
>
> Yeah, and that's actually analogous to the example I used in my
> presentation.
>
> The way I think of that problem is that once you know the postcode,
> knowing the city name doesn't add any information. The postcode implies
> the city name. So the selectivity for "postcode = ? AND city = ?" should
> be the selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of
> distinct values of column B for each value of column A, or something
> like that. I don't know what the statisticians call that property, or if
> there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-12-12 16:24:02 Re: function attributes
Previous Message Tom Lane 2010-12-12 16:16:46 Re: [COMMITTERS] pgsql: Use symbolic names not octal constants for file permission flags