Re: Cross-field statistics

From: Decibel! <decibel(at)decibel(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-field statistics
Date: 2008-04-18 00:28:51
Message-ID: F03A6C3F-A325-4BB7-AEA0-784DFE2F5F54@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Apr 17, 2008, at 12:22 PM, Gregory Stark wrote:
> "Decibel!" <decibel(at)decibel(dot)org> writes:
>
>> For each field that isn't already in a set of field groupings
>> * Sort sample rows on that field
>> * Calculate correlation for all other fields
>> * If there are other fields that have a correlation to this sort
>> order over
>> some threshold, save them along with the field we originally
>> sorted on as a
>> new 'field grouping'
>> * Else, there are no other fields that group with this field;
>> it's a "loner"
>
> I think this is going somewhere. But "correlation" isn't quite
> right. It has
> the same problem our use of correlation for clusteredness has.
> Consider the
> case of Zip code and City. They're nearly very non-independent
> variables but
> there's basically no correlation.

If we have a limited number of values in one of the fields, it would
be possible to build a histogram of values for other fields based on
the values in the first field. But I can't see how we could possibly
represent something like city, zip in a compact form. You would have
to keep a range of zips that cover a city.

Hmm... but we only care about cities that have a substantial number
of zip codes. This is what the equivalent of the most-common-values
list would be for cross-platform stats: for the most_common_vals in
column a, you store a range or histogram of the corresponding values
in b, assuming that there is a good correspondence.

>> For each field grouping, at a minimum we'd need to store a
>> histogram for that
>> grouping.
>
> This is a problem. What does a histogram on a grouping mean? It's
> not clear
> how to come up with a histogram which can help answer questions like
> A between ? and ? and B between ? and ?
>
> You can do a histogram on <a,b> or <b,a> but neither are going to be
> especially useful. Heikki and I came up with a weird hybrid thing
> which might
> be useful for avoiding overestimating selectivity like
> WHERE city='BOS' AND areacode = '617'
>
> But it didn't help at all with the converse, ie:
> WHERE city='BOS' AND areacode = '212'
>
> It's hard to see how we could possibly catch cases like that though.

If the two fields share the same correlation, then the histogram is
just what we use right now. We could actually do this today, but only
for fields with a high physical correlation. What I was describing
allowed extending this to fields that have a high correlation to each
other, even if they didn't have a high physical correlation. I know
that this doesn't help us for things like city/area code or city/zip,
but other than my idea above I'm rather at a loss on how to represent
that in a compact fashion.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-04-18 00:42:26 Re: Proposed patch - psql wraps at window width
Previous Message Aidan Van Dyk 2008-04-18 00:26:58 Re: Lessons from commit fest