Re: Multi-Dimensional Histograms

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-30 02:22:15
Message-ID: 603c8f070906291922s34f786c4k683860696354a456@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<npboley(at)gmail(dot)com> wrote:
> On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> David Fetter <david(at)fetter(dot)org> writes:
>>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>>> ... They dismiss
>>>> singular value decomposition and the discrete wavelet transform as
>>>> being too parametric ( which is silly, IMHO )
>>
>>> Should we have a separate discussion about eigenvalues?  Wavelets?
>>
>> I think it'd be a short discussion: what will you do with non-numeric
>> datatypes? We probably don't really want to assume anything stronger
>> than that the datatype has a total ordering.
>
> Well, in the general case, we could use their ranks.
>
> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.
>
> That being said, for geographic data in particular, PCA or similar
> could work well.

I'm finding myself unable to follow all the terminology on this thead.
What's dimension reduction? What's PCA?

Based on my last few months of answering questions on -performance,
and my own experience, it seems like a lot of the cases that arise in
practice are those where there is a WHERE clause of the form:

colA = constA and colB op constB

...and it sometimes turns out that the subset of the data where colA =
constA has a very different distribution for colB than the data as a
whole, leading to bad plans. In many cases, it seems like colA is
storing some discrete type of thing, like a customer ID, so the
distribution of colB where colA = constA tells you nothing about the
distribution of colB where colA = constA + someSmallDeltaA. It feels
like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA. Unfortunately, in
the general case the set of values of colA for which you need these
statistics might be inconveniently large.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message KaiGai Kohei 2009-06-30 03:23:44 Re: [PATCH] [v8.5] Security checks on largeobjects
Previous Message Bruce YUAN 2009-06-30 02:10:52 Re: How to register my function into backend?