Re: index structure for 114-dimension vector

From: "Alexander Staubo" <alex(at)purefiction(dot)net>
To: "Andrew Lazarus" <andrew(at)pillette(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: index structure for 114-dimension vector
Date: 2007-04-26 22:34:36
Message-ID: 88daf38c0704261534n2a6d3ab8g57482bae8672da53@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 4/20/07, Andrew Lazarus <andrew(at)pillette(dot)com> wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usual vector norm. Speed is critical here, and everything I have tried
> has been too slow.

Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper "Efficient k-NN Search on
Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)

Alexander.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2007-04-26 22:42:26 Re: Simple query, 10 million records...MySQL ten times faster
Previous Message Jeff Hoffmann 2007-04-26 21:40:17 Re: Simple query, 10 million records...MySQL ten times faster