Quick Links

Re: Bloom index

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Bloom index
Date:	2010-01-18 13:13:18
Message-ID:	407d949e1001180513g4216844dl48f3701ab32310d1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

2010/1/18 Teodor Sigaev <teodor(at)sigaev(dot)ru>:
>> So for example if your bloom filter is 4 bits per column your error
>> rate for a single column where clause will be 1/2^(4/1.44) or a little
>> worse than returning 1 record in 7. If you test two or three columns
>> then it'll be about 1 in 49 or 1 in 343.
>
> Hmm, I don't understand your calculations. In your example (4 bits per
> column, index has only one column) each row is hashed by 4 bits in 80-bit
> vector (signature), so, probability of collision is (1/80)^4

Sorry, I meant if you had n columns in the index, and the signature
was 4*n bits, and the where clause has one key in the search
condition. According to some guy editing the wikipedia page the number
of bits per value in the set a bloom filter needs to achieve an error
rate of e is 1.44*log_2(1/e). So the error rate for 4 bits per column
would be 1/2^(1.44*nbits). That's larger than the error rate for a
simple hash which would be 1/2^nbits.

The reason bloom filters are advantageous when checking for a value
*anywhere* in the set is that with a simple hash of each value you
would have that error rate for each check, so you would have a much
higher false positive rate for finding it somewhere. But in your case
you know which position you want to check.

--
greg

In response to

Re: Bloom index at 2010-01-18 12:47:39 from Teodor Sigaev

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Stark	2010-01-18 13:23:21	Bloom filters bloom filters bloom filters
Previous Message	Teodor Sigaev	2010-01-18 12:47:39	Re: Bloom index