Re: index structure for 114-dimension vector

From: Arjen van der Meijden <acmmailing(at)tweakers(dot)net>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Andrew Lazarus <andrew(at)pillette(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: index structure for 114-dimension vector
Date: 2007-04-27 06:07:38
Message-ID: 4631932A.6000000@tweakers.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 21-4-2007 1:42 Mark Kirkwood wrote:
> I don't think that will work for the vector norm i.e:
>
> |x - y| = sqrt(sum over j ((x[j] - y[j])^2))

I don't know if this is usefull here, but I was able to rewrite that
algorithm for a set of very sparse vectors (i.e. they had very little
overlapping factors) to something like:
|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
+ for each j where x[j] and y[j] are both non-zero: - (x[j]^2 +
y[j]^2) + (x[j] - y[j])^2

The first two parts sums can be calculated only once. So if you have
very little overlap, this is therefore much more efficient (if there is
no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this
rewritten calculation allows you to store the X and Y vectors using a
trivial table-layout vector(x,i,value) which is only filled with
non-zero's and which you can trivially self-join to find the closest
matches. You don't care about the j's where there is either no x or
y-value anyway with this rewrite.

I can compare over 1000 y's of on average 100 elements to two x's of
over 1000 elements on just a single 1.8Ghz amd processor. (I use it for
a bi-kmeans algorithm, so there are only two buckets to compare to).

So it might be possible to rewrite your algorithm to be less
calculation-intensive. Obviously, with a dense-matrix this isn't going
to work, but there may be other ways to circumvent parts of the
algorithm or to cache large parts of it.
It might also help to extract only the 6 relevant columns into a
seperate temporary table which will have much smaller records and thus
can fit more records per page.

Best regards,

Arjen

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Florian Weimer 2007-04-27 07:39:28 Re: Filesystem fragmentation (Re: Fragmentation of WAL files)
Previous Message Greg Smith 2007-04-27 04:50:55 Re: Fragmentation of WAL files