## Re: Multi-Dimensional Histograms

From: Nathan Boley Robert Haas Tom Lane , David Fetter , PG Hackers Re: Multi-Dimensional Histograms 2009-06-30 19:39:48 6fa3b6e20906301239w1bf3b1cer88545747d041c3c@mail.gmail.com (view raw, whole thread or download thread mbox) 2009-06-29 16:52:00 from David Fetter  2009-06-29 20:28:01 from Nathan Boley   2009-06-29 20:44:30 from David Fetter    2009-06-29 22:43:35 from Tom Lane     2009-06-29 22:56:44 from David Fetter     2009-06-30 00:17:00 from Nathan Boley      2009-06-30 02:22:15 from Robert Haas       2009-06-30 04:34:22 from Joshua Tolley       2009-06-30 19:39:48 from Nathan Boley       2009-07-05 08:15:22 from Gregory Maxwell    2009-06-29 23:49:50 from Nathan Boley pgsql-hackers
```> I'm finding myself unable to follow all the terminology on this thead.
>  What's dimension reduction?  What's PCA?

( Sorry for the jargon - thanks Josh )

> It feels like what you might need is statistics for colB (MCVs and/or a
> histogram) for certain particular values of colA.

Certainly - this is exactly what a multi-dimensional histogram would store.

> Unfortunately, in
> the general case the set of values of colA for which you need these
> statistics might be inconveniently large.
>

Which is why, in the general case, we need some sort of dimension reduction.

If we could say that, for instance, the distribution of colB is
roughly the same for all values of colA less than 100, and it is
roughly the same for all values of colA >= 100, then we would have
effectively reduced the dimension from ndistinct(colB)*ndistinct(colA)
to 2*ndistinct(colB).

The multidimensional histogram in the above paper is somewhat akin to
what you suggested and I just described - it attempts to select
contiguous regions that are "the same". PCA is a much different
approach, but it is still, broadly, a dimension reduction technique.

> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.

To clarify my response to Tom, directly above, if as in Robert's
example, all of the values in colA < 100 were different than all of
the values > 100 with respect to colB, then it is easy to represent
the structure in something manageable. Ie, we store two histograms for
colB: 1 for colA > 100, one for colA < 100. However, if there are the
same two classes in colB ( lets say C1 and C2 ) but rather than C1
being associated with values in colA < 100, it is associated with 100
completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ...
) there is not a whole lot we can do besides storing a list of all of
those values. We could maybe use the ratio of classes to improve the
average plan choice, but you would still get a lot of bad plans.

Could anyone provide a concrete example ( or better yet a data set )
where this occurs?

-Nathan

```

### pgsql-hackers by date

 Next: From: Chris Browne Date: 2009-06-30 19:41:40 Subject: Re: pre-proposal: permissions made easier Previous: From: Richard Huxton Date: 2009-06-30 19:31:42 Subject: Re: 8.5 development schedule