Re: Multi-Dimensional Histograms

From: David Fetter <david(at)fetter(dot)org>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-29 20:44:30
Message-ID: 20090629204430.GE21081@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
> > For things like PostGIS, which will want to index in 4 dimensions
> > (x, y, z, t), we might want to have multi-dimensional selectivity
> > histograms and some way to use same.
>
> Another use case is cross column statistics.

Good to see there's more than one use case. I was hoping people would
chime in with more use cases, and here one is, fast! :)

> > Anybody here qualified to check out this paper on the subject, please
> > speak up :)
>
> Well, I don't know if I'm qualified per se, but here goes...
>
> It seems seems like a very reasonable approach to multidimensional
> histograms.
>
> For those who haven't had time to read the paper, it describes
> SGRID, a greedy algorithm for partitioning an n-dimensional space
> into non-overlapping n-dimensional boxes and 'outliers' ( points
> that dont fit into any boxes, but dont warrant their own boxes ).
> Then, they compare their method to a regression technique, where
> the space is fit into a fixed grid histogram and then the grid
> densities are estimated via regression. They also compare a plane
> splitting algorithm, called MHIST, where the planes are constrained
> to be orthogonal to the naive dimension vectors. They dismiss
> singular value decomposition and the discrete wavelet transform as
> being too parametric ( which is silly, IMHO )

Should we have a separate discussion about eigenvalues? Wavelets?

> and briefly mention a couple standard clustering techniques ( which
> are probably not appropriate for us, given their complexity ).

Good to know.

> Unsurprisingly, it does well on the two test sets that they
> consider.

Wait. You mean they might have carefully chosen data to make their
point?!?

;)

> It think the general idea is fine, but it would certainly need some
> modification to work for postgres.
>
> In particular,
>
> Using the naive dimension vectors ( ie, north and east for geo data,
> or column a and column b for cross-column stats ) in combination with
> the box constraint will probably lead to problems. Consider, for
> example, a river that goes in a straight line north-east, and a table
> that stores camp sites which are mostly along the river. Because SGRID
> can only create boxes with north east edges, you would end up with a
> bunch of tiny box on the river where one, long north-east pointing box
> would do.
>
> We would probably want to replace the error term that SGRID uses to
> determine when to create a new box by a statistical test ( maybe
> homogeneity of means? ) ( also, the same holds for the outliers ).
>
> Finally, this creates the partition but ( AFAICT ) it doesn't describe
> a method for locating the histogram estimate given a point ( although
> that doesn't seem too difficult ).

Is that "not difficult," in terms of the math that needs doing, or
"not difficult," in terms of how well PostgreSQL is already set up to
implement, or...?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Filip Rembiałkowski 2009-06-29 20:48:45 Re: facing problem with createdb.
Previous Message Nathan Boley 2009-06-29 20:28:01 Re: Multi-Dimensional Histograms