Skip site navigation (1) Skip section navigation (2)

Re: index structure for 114-dimension vector

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Alexander Staubo <alex(at)purefiction(dot)net>
Cc: Andrew Lazarus <andrew(at)pillette(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: index structure for 114-dimension vector
Date: 2007-04-27 04:42:37
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Fri, 27 Apr 2007, Alexander Staubo wrote:

> On 4/20/07, Andrew Lazarus <andrew(at)pillette(dot)com> wrote:
>> I have a table with 2.5 million real[] arrays. (They are points in a
>> time series.) Given a new array X, I'd like to find, say, the 25
>> closest to X in some sense--for simplification, let's just say in the
>> usual vector norm. Speed is critical here, and everything I have tried
>> has been too slow.
> Let me chime in with the observation that this is a multidimensional
> nearest neighbour (reverse nearest neighbour and its close cousin,
> k-NN) that is well known in statistics, and particularly relevant to
> statistical learning and classification. Knowing the jargon might help
> you dig up efficient algorithms to mine your data; there are tons of
> fascinating papers available through Citeseer.
> In particular, I recommend the paper "Efficient k-NN Search on
> Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here:
>, if only for inspiration. It
> proposes an algorithm called BOND to drastically reduce the search
> space by probalistic means. They give an example using image
> histograms, but the algorithm applies to any multidimensional data.
> Briefly put, it points out that proximity comparison can be computed
> vertically, a few dimensions at a time, and entire subsets can be
> thrown away when it's apparent that they are below a statistically
> derived lower bound. The only gotcha is that the algorithm derives
> much of its performance from the assumption that your data is
> vertically decomposed, one table per dimension, otherwise the search
> effectively incurs a sequential scan of the entire dataset, and then
> you're pretty much back to square one.
> The most common approach to nearest neighbour search is to use a
> spatial data structure. The classic algorithm is the kd-tree
> ( and there's the newer K-D-B
> tree, neither of which are available in PostgreSQL. If I remember
> correctly, R-trees have also been shown to be useful for high numbers
> of dimensions; with PostgreSQL you have R-trees and even better
> R-tree-equivalent support through GiST. I have no idea whether you can
> actually munge your integer vectors into something GiST can index and
> search, but it's a thought. (GiST, presumably, can also theoretically
> index kd-trees and other spatial trees.)

you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which 
could be used for k-d-b tree,  see
I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery

> Alexander.
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?

Oleg Bartunov, Research Scientist, Head of AstroNet (,
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su,
phone: +007(495)939-16-83, +007(495)939-23-83

In response to


pgsql-performance by date

Next:From: Andres RetzlaffDate: 2007-04-27 04:43:06
Subject: Usage up to 50% CPU
Previous:From: Tom LaneDate: 2007-04-27 02:49:13
Subject: Re: Feature Request --- was: PostgreSQL Performance Tuning

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group