## Re: Multi-Dimensional Histograms

From: David Fetter Nathan Boley PG Hackers Re: Multi-Dimensional Histograms 2009-06-29 20:44:30 20090629204430.GE21081@fetter.org (view raw, whole thread or download thread mbox) 2009-06-29 16:52:00 from David Fetter  2009-06-29 20:28:01 from Nathan Boley   2009-06-29 20:44:30 from David Fetter    2009-06-29 22:43:35 from Tom Lane     2009-06-29 22:56:44 from David Fetter     2009-06-30 00:17:00 from Nathan Boley      2009-06-30 02:22:15 from Robert Haas       2009-06-30 04:34:22 from Joshua Tolley       2009-06-30 19:39:48 from Nathan Boley       2009-07-05 08:15:22 from Gregory Maxwell    2009-06-29 23:49:50 from Nathan Boley pgsql-hackers
```On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
> > For things like PostGIS, which will want to index in 4 dimensions
> > (x, y, z, t), we might want to have multi-dimensional selectivity
> > histograms and some way to use same.
>
> Another use case is cross column statistics.

Good to see there's more than one use case.  I was hoping people would
chime in with more use cases, and here one is, fast! :)

> > Anybody here qualified to check out this paper on the subject, please
> > speak up :)
>
> Well, I don't know if I'm qualified per se, but here goes...
>
> It seems seems like a very reasonable approach to multidimensional
> histograms.
>
> For those who haven't had time to read the paper, it describes
> SGRID, a greedy algorithm for partitioning an n-dimensional space
> into non-overlapping n-dimensional boxes and 'outliers' ( points
> that dont fit into any boxes, but dont warrant their own boxes ).
> Then, they compare their method to a regression technique,  where
> the space is fit into a fixed grid histogram and then the grid
> densities are estimated via regression. They also compare a plane
> splitting algorithm, called MHIST, where the planes are constrained
> to be orthogonal to the naive dimension vectors. They dismiss
> singular value decomposition and the discrete wavelet transform as
> being too parametric ( which is silly, IMHO )

Should we have a separate discussion about eigenvalues?  Wavelets?

> and briefly mention a couple standard clustering techniques ( which
> are probably not appropriate for us, given their complexity ).

Good to know.

> Unsurprisingly, it does well on the two test sets that they
> consider.

Wait.  You mean they might have carefully chosen data to make their
point?!?

;)

> It think the general idea is fine, but it would certainly need some
> modification to work for postgres.
>
> In particular,
>
> Using the naive dimension vectors ( ie, north and east for geo data,
> or column a and column b for cross-column stats ) in combination with
> the box constraint will probably lead to problems. Consider, for
> example, a river that goes in a straight line north-east, and a table
> that stores camp sites which are mostly along the river. Because SGRID
> can only create boxes with north east edges, you would end up with a
> bunch of tiny box on the river where one, long north-east pointing box
> would do.
>
> We would probably want to replace the error term that SGRID uses to
> determine when to create a new box by a statistical test ( maybe
> homogeneity of means? ) ( also, the same holds for the outliers ).
>
> Finally, this creates the partition but ( AFAICT ) it doesn't describe
> a method for locating the histogram estimate given a point ( although
> that doesn't seem too difficult ).

Is that "not difficult," in terms of the math that needs doing, or
"not difficult," in terms of how well PostgreSQL is already set up to
implement, or...?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!