Re: Cross-field statistics

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-field statistics
Date: 2008-04-17 17:22:45
Message-ID: 878wzccs62.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Decibel!" <decibel(at)decibel(dot)org> writes:

> For each field that isn't already in a set of field groupings
> * Sort sample rows on that field
> * Calculate correlation for all other fields
> * If there are other fields that have a correlation to this sort order over
> some threshold, save them along with the field we originally sorted on as a
> new 'field grouping'
> * Else, there are no other fields that group with this field; it's a "loner"

I think this is going somewhere. But "correlation" isn't quite right. It has
the same problem our use of correlation for clusteredness has. Consider the
case of Zip code and City. They're nearly very non-independent variables but
there's basically no correlation.

If we found the right metric for clusteredness we could probably use it here
though too though.

> For each field grouping, at a minimum we'd need to store a histogram for that
> grouping.

This is a problem. What does a histogram on a grouping mean? It's not clear
how to come up with a histogram which can help answer questions like
A between ? and ? and B between ? and ?

You can do a histogram on <a,b> or <b,a> but neither are going to be
especially useful. Heikki and I came up with a weird hybrid thing which might
be useful for avoiding overestimating selectivity like
WHERE city='BOS' AND areacode = '617'

But it didn't help at all with the converse, ie:
WHERE city='BOS' AND areacode = '212'

It's hard to see how we could possibly catch cases like that though.

> The important thing is that this scheme adds less than O(n) (n being the
> number of fields), and not O(n^2), both in terms of ANALYZE

It looks like a good method for *finding* column sets which will be
interesting to keep more stats on. That's definitely one of the challenges.

I'm still not sure what stats to actually gather on the resulting column sets.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-04-17 17:25:34 Re: Lessons from commit fest
Previous Message Bruce Momjian 2008-04-17 17:13:04 Re: Lessons from commit fest