## Re: Multi-Dimensional Histograms

From: Nathan Boley David Fetter PG Hackers Re: Multi-Dimensional Histograms 2009-06-29 20:28:01 6fa3b6e20906291328jc6f33b9me6f911845ea70bfa@mail.gmail.com (view raw, whole thread or download thread mbox) 2009-06-29 16:52:00 from David Fetter  2009-06-29 20:28:01 from Nathan Boley   2009-06-29 20:44:30 from David Fetter    2009-06-29 22:43:35 from Tom Lane     2009-06-29 22:56:44 from David Fetter     2009-06-30 00:17:00 from Nathan Boley      2009-06-30 02:22:15 from Robert Haas       2009-06-30 04:34:22 from Joshua Tolley       2009-06-30 19:39:48 from Nathan Boley       2009-07-05 08:15:22 from Gregory Maxwell    2009-06-29 23:49:50 from Nathan Boley pgsql-hackers
> For things like PostGIS, which will want to index in 4 dimensions
> (x, y, z, t), we might want to have multi-dimensional selectivity
> histograms and some way to use same.
>

Another use case is cross column statistics.

> Anybody here qualified to check out this paper on the subject, please
> speak up :)
>

Well, I don't know if I'm qualified per se, but here goes...

It seems seems like a very reasonable approach to multidimensional histograms.

For those who haven't had time to read the paper, it describes SGRID,
a greedy algorithm for partitioning an n-dimensional space into
non-overlapping n-dimensional boxes and 'outliers' ( points that dont
fit into any boxes, but dont warrant their own boxes ). Then, they
compare their method to a regression technique,  where the space is
fit into a fixed grid histogram and then the grid densities are
estimated via regression. They also compare a plane splitting
algorithm, called MHIST, where the planes are constrained to be
orthogonal to the naive dimension vectors. They dismiss singular value
decomposition and the discrete wavelet transform as being too
parametric ( which is silly, IMHO ) and briefly mention a couple
standard clustering techniques ( which are probably not appropriate
for us, given their complexity ). Unsurprisingly, it does well on the
two test sets that they consider.

It think the general idea is fine, but it would certainly need some
modification to work for postgres.

In particular,

Using the naive dimension vectors ( ie, north and east for geo data,
or column a and column b for cross-column stats ) in combination with
the box constraint will probably lead to problems. Consider, for
example, a river that goes in a straight line north-east, and a table
that stores camp sites which are mostly along the river. Because SGRID
can only create boxes with north east edges, you would end up with a
bunch of tiny box on the river where one, long north-east pointing box
would do.

We would probably want to replace the error term that SGRID uses to
determine when to create a new box by a statistical test ( maybe
homogeneity of means? ) ( also, the same holds for the outliers ).

Finally, this creates the partition but ( AFAICT ) it doesn't describe
a method for locating the histogram estimate given a point ( although
that doesn't seem too difficult ).

-Nathan

### pgsql-hackers by date

 Next: From: David Fetter Date: 2009-06-29 20:44:30 Subject: Re: Multi-Dimensional Histograms Previous: From: Josh Berkus Date: 2009-06-29 20:02:46 Subject: Re: pre-proposal: permissions made easier