Re: Multi-Dimensional Histograms

From: Nathan Boley <npboley(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-30 19:39:48
Message-ID: 6fa3b6e20906301239w1bf3b1cer88545747d041c3c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I'm finding myself unable to follow all the terminology on this thead.
>  What's dimension reduction?  What's PCA?

( Sorry for the jargon - thanks Josh )

> It feels like what you might need is statistics for colB (MCVs and/or a
> histogram) for certain particular values of colA.

Certainly - this is exactly what a multi-dimensional histogram would store.

> Unfortunately, in
> the general case the set of values of colA for which you need these
> statistics might be inconveniently large.
>

Which is why, in the general case, we need some sort of dimension reduction.

If we could say that, for instance, the distribution of colB is
roughly the same for all values of colA less than 100, and it is
roughly the same for all values of colA >= 100, then we would have
effectively reduced the dimension from ndistinct(colB)*ndistinct(colA)
to 2*ndistinct(colB).

The multidimensional histogram in the above paper is somewhat akin to
what you suggested and I just described - it attempts to select
contiguous regions that are "the same". PCA is a much different
approach, but it is still, broadly, a dimension reduction technique.

> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.

To clarify my response to Tom, directly above, if as in Robert's
example, all of the values in colA < 100 were different than all of
the values > 100 with respect to colB, then it is easy to represent
the structure in something manageable. Ie, we store two histograms for
colB: 1 for colA > 100, one for colA < 100. However, if there are the
same two classes in colB ( lets say C1 and C2 ) but rather than C1
being associated with values in colA < 100, it is associated with 100
completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ...
) there is not a whole lot we can do besides storing a list of all of
those values. We could maybe use the ratio of classes to improve the
average plan choice, but you would still get a lot of bad plans.

Could anyone provide a concrete example ( or better yet a data set )
where this occurs?

-Nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chris Browne 2009-06-30 19:41:40 Re: pre-proposal: permissions made easier
Previous Message Richard Huxton 2009-06-30 19:31:42 Re: 8.5 development schedule