Re: Cross-column statistics revisited

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 06:24:21
Message-ID: 20081017062421.GA1443@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote:
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult.

Just a note: using a multidimensional histograms will work well for the
cases like (startdate,enddate) where the histogram will show a
clustering of values along the diagonal. But it will fail for the case
(zipcode,state) where one implies the other. Histogram-wise you're not
going to see any correlation at all but what you want to know is:

count(distinct zipcode,state) = count(distinct zipcode)

So you might need to think about storing/searching for different kinds
of correlation.

Secondly, my feeling about multidimensional histograms is that you're
not going to need the matrix to have 100 bins along each axis, but that
it'll be enough to have 1000 bins total. The cases where we get it
wrong enough for people to notice will probably be the same cases where
the histogram will have noticable variation even for a small number of
bins.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2008-10-17 06:41:40 Re: Cross-column statistics revisited
Previous Message Joshua Tolley 2008-10-17 03:17:03 Re: Cross-column statistics revisited