Re: index structure for 114-dimension vector

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Alexander Staubo <alex(at)purefiction(dot)net>
Cc: Andrew Lazarus <andrew(at)pillette(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: index structure for 114-dimension vector
Date: 2007-04-27 04:42:37
Message-ID: Pine.LNX.4.64.0704270838230.12152@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, 27 Apr 2007, Alexander Staubo wrote:

> On 4/20/07, Andrew Lazarus <andrew(at)pillette(dot)com> wrote:
>> I have a table with 2.5 million real[] arrays. (They are points in a
>> time series.) Given a new array X, I'd like to find, say, the 25
>> closest to X in some sense--for simplification, let's just say in the
>> usual vector norm. Speed is critical here, and everything I have tried
>> has been too slow.
>
> Let me chime in with the observation that this is a multidimensional
> nearest neighbour (reverse nearest neighbour and its close cousin,
> k-NN) that is well known in statistics, and particularly relevant to
> statistical learning and classification. Knowing the jargon might help
> you dig up efficient algorithms to mine your data; there are tons of
> fascinating papers available through Citeseer.
>
> In particular, I recommend the paper "Efficient k-NN Search on
> Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
> http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
> proposes an algorithm called BOND to drastically reduce the search
> space by probalistic means. They give an example using image
> histograms, but the algorithm applies to any multidimensional data.
> Briefly put, it points out that proximity comparison can be computed
> vertically, a few dimensions at a time, and entire subsets can be
> thrown away when it's apparent that they are below a statistically
> derived lower bound. The only gotcha is that the algorithm derives
> much of its performance from the assumption that your data is
> vertically decomposed, one table per dimension, otherwise the search
> effectively incurs a sequential scan of the entire dataset, and then
> you're pretty much back to square one.
>
> The most common approach to nearest neighbour search is to use a
> spatial data structure. The classic algorithm is the kd-tree
> (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
> tree, neither of which are available in PostgreSQL. If I remember
> correctly, R-trees have also been shown to be useful for high numbers
> of dimensions; with PostgreSQL you have R-trees and even better
> R-tree-equivalent support through GiST. I have no idea whether you can
> actually munge your integer vectors into something GiST can index and
> search, but it's a thought. (GiST, presumably, can also theoretically
> index kd-trees and other spatial trees.)

you're right, but currently only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which
could be used for k-d-b tree, see http://www.cs.purdue.edu/spgist/
I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery

>
> Alexander.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andres Retzlaff 2007-04-27 04:43:06 Usage up to 50% CPU
Previous Message Tom Lane 2007-04-27 02:49:13 Re: Feature Request --- was: PostgreSQL Performance Tuning