Re: index structure for 114-dimension vector

From: C Storm <christian(dot)storm(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: index structure for 114-dimension vector
Date: 2007-04-23 17:49:31
Message-ID: 1177350571.162300.296770@b75g2000hsg.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Apr 20, 12:07 pm, and(dot)(dot)(dot)(at)pillette(dot)com (Andrew Lazarus) wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usualvectornorm. Speed is critical here, and everything I have tried
> has been too slow.
>
> I imported the cube contrib package, and I tried creating an index on
> a cube of the last 6 elements, which are the most important. Then I
> tested the 2.5MM rows for being contained within a tolerance of the
> last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
> be an indexed search (which I CLUSTERED on). I then ran the sort on
> this smaller set. The index was used, but it was still too slow. I
> also tried creating new columns with rounded int2 values of the last 6
> coordinates and made a multicolumn index.
>
> For each X the search is taking about 4-15 seconds which is above my
> target at least one order of magnitude. Absolute numbers are dependent
> on my hardware and settings, and some of this can be addressed with
> configuration tweaks, etc., but first I think I need to know the
> optimum data structure/indexing strategy.
>
> Is anyone on the list experienced with this sort of issue?
>
> Thanks.
> Andrew Lazarus and(dot)(dot)(dot)(at)pillette(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

Having worked in high dimensional spaces a lot in my career I think
you'll find that there are
mathematical limits in terms of speed. In practical terms, a seq_scan
will be unavoidable since
on first approximation you are limited to doing an exhaustive search
in 101-dimensional space unless
you make approximations or dimensionality reductions of some kind.

Read up on the Curse of Dimensionality: http://en.wikipedia.org/wiki/Curse_of_dimensionality

Have you considered dimension reduction techniques such as Singular
Value Decomposition,
Principal Components Analysis, etc.?

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Steinar H. Gunderson 2007-04-23 17:58:11 Re: index usage
Previous Message Arkadiusz Raj 2007-04-23 17:20:29 index usage