Re: scoring differences between bitmasks

From: "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>
To: Ben <bench(at)silentmedia(dot)com>
Cc: Postgresql-General mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: scoring differences between bitmasks
Date: 2005-10-03 00:16:06
Message-ID: 43407846.3080601@blackducksoftware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

Sorry for being so abtuse; my 9 year old has been helping me with my
work today. :)

The hamming distance between two bit vectors can also be found as the
sum of distances between sub-vectors. Let's assume you're looking
for all vectors less than d bits away from a search vector, and we'll
divide the vectors into k-bit subvectors. The value of k will depend
on d, how sparse you data is, and how much memory you have for lookup
tables. The goal is to eliminate the vectors that are more than d
bits away. (This assumes that most of the search vectors are more
than d bits away; for the opposite case, just reverse the sense of
the comparisons below.)

Compute the set of k-bit vectors that <d away from the first k-bits
of the search vector, using a lookup table if feasible. Find the
set of full vectors whose initial k-vector is in this set. For
those whose distance is d-1, the remaining bits have to be identical,
which is easy to check. Repeat for the second set of k-vectors
where their distance has to be less than d-1, then d-2, etc. You
can do these searches using the linked tries I mentioned earlier.
(I've described the searches as happening sequentially, but that
needn't be the case. You can search the subvectors simultaneously
and then filter the results at the client.)

You don't have to use constant length subvectors. You can start
with small subvectors to keep the size of the search sets down,
and then increase the vector length as the set of remaining
possibilities shrinks. Also, you don't have to work from one
end of the full vector to the other; start with a subvector where
you expect a lot of variability.

I think this method is similar in spirit, if not detail, to what
Martijn suggested.

I've actually implemented this approach before, though using my
own file-based lookups rather than a real database. I had a set
of 128-bit vectors used to filter an input stream; most of the
comparisons never looked past 64 bits.

-- todd

Ben wrote:
> Sure, but unless I can figure out some way to choose a small number of
> vectors, I'm left with computing the full N^2 set. Given that I'm
> shooting for N to be 4 million or larger, that's a lot of data to
> store.....
>
> On Oct 2, 2005, at 12:14 PM, Todd A. Cook wrote:
>
>> Ben wrote:
>>
>>> Just the number of bits, not which ones. Basically, the hamming
>>> distance.
>>>
>>
>> I see. Could you pre-compute the bit counts for the vectors in the
>> table?
>> You could count the bits in the search vector as Martijn suggested,
>> and then
>> do a lookup based on the count.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Wes 2005-10-03 00:22:09 Re: 8.1 'make check' fails
Previous Message Zlatko Matić 2005-10-02 23:14:57 Unbound text box, Text > 255 characters, MSAccess/PostgreSQL