Re: Cross-column statistics revisited

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 06:41:40
Message-ID: 20081017064140.GB1443@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
> Correlation is the wrong tool. In fact zip codes and city have nearly
> zero correlation. Zip codes near 00000 are no more likely to be in
> cities starting with A than Z.

I think we need to define our terms better. In terms of linear
correlation you are correct. However, you can define invertable mappings
from zip codes and cities onto the integers which will then have an
almost perfect correlation.

According to a paper I found this is related to the "principle of
maximum entropy". The fact that you can't determine such functions
easily in practice doesn't change the fact that zip codes and city
names are highly correlated.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-10-17 08:00:42 Re: Cross-column statistics revisited
Previous Message Martijn van Oosterhout 2008-10-17 06:24:21 Re: Cross-column statistics revisited