Re: Bloom index

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bloom index
Date: 2010-01-18 13:13:18
Message-ID: 407d949e1001180513g4216844dl48f3701ab32310d1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2010/1/18 Teodor Sigaev <teodor(at)sigaev(dot)ru>:
>> So for example if your bloom filter is 4 bits per column your error
>> rate for a single column where clause will be 1/2^(4/1.44) or a little
>> worse than returning 1 record in 7. If you test two or three columns
>> then it'll be about 1 in 49 or 1 in 343.
>
> Hmm, I don't understand your calculations. In your example (4 bits per
> column, index has only one column) each row is hashed by 4 bits in 80-bit
> vector (signature), so, probability of collision is (1/80)^4

Sorry, I meant if you had n columns in the index, and the signature
was 4*n bits, and the where clause has one key in the search
condition. According to some guy editing the wikipedia page the number
of bits per value in the set a bloom filter needs to achieve an error
rate of e is 1.44*log_2(1/e). So the error rate for 4 bits per column
would be 1/2^(1.44*nbits). That's larger than the error rate for a
simple hash which would be 1/2^nbits.

The reason bloom filters are advantageous when checking for a value
*anywhere* in the set is that with a simple hash of each value you
would have that error rate for each check, so you would have a much
higher false positive rate for finding it somewhere. But in your case
you know which position you want to check.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2010-01-18 13:23:21 Bloom filters bloom filters bloom filters
Previous Message Teodor Sigaev 2010-01-18 12:47:39 Re: Bloom index