Re: proposal : cross-column stats

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-12 14:43:00
Message-ID: 4D04DF74.1010706@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12.12.2010 15:17, Martijn van Oosterhout wrote:
> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Very cool that you're working on this.

+1

>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
>
> It looks like you handled most of the issues. Just a few points:
>
> - This is obviously applicable to more than just integers, probably
> anything with a b-tree operator class. What you've coded seems rely
> on calculations on the values. Have you thought about how it could
> work for, for example, strings?
>
> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my
presentation.

The way I think of that problem is that once you know the postcode,
knowing the city name doesn't add any information. The postcode implies
the city name. So the selectivity for "postcode = ? AND city = ?" should
be the selectivity of "postcode = ?" alone. The measurement we need is
"implicativeness": How strongly does column A imply a certain value for
column B. Perhaps that could be measured by counting the number of
distinct values of column B for each value of column A, or something
like that. I don't know what the statisticians call that property, or if
there's some existing theory on how to measure that from a sample.

That's assuming the combination has any matches. It's possible that the
user chooses a postcode and city combination that doesn't exist, but
that's no different from a user doing "city = 'fsdfsdfsd'" on a single
column, returning no matches. We should assume that the combination
makes sense.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-12-12 15:43:16 Re: function attributes
Previous Message Oleg Bartunov 2010-12-12 14:32:01 Re: Extensions, patch v16