Re: Bloom index

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bloom index
Date: 2010-01-18 08:37:03
Message-ID: 407d949e1001180037k575aeb16v572ac64a569c14f0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 18, 2010 at 2:29 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Bloom filters need to be sized to have n bits per set element to
> maintain a targeted false positive rate. And that false positive rate
> would be higher than just having an n bit hash for each set element.
> Normally they have the advantage that you can test for membership
> anywhere in the set quickly -- but in this case you actually only want
> to test a specific position in the set anyways.
>
> So I think you can get a lower false positive rate by just truncating
> the each column's hash value to n bits and storing an array of them.

So for example if your bloom filter is 4 bits per column your error
rate for a single column where clause will be 1/2^(4/1.44) or a little
worse than returning 1 record in 7. If you test two or three columns
then it'll be about 1 in 49 or 1 in 343.

If you just stored a nibble for each column then you could look up the
specific nibble for the column in a single column where clause and get
an error rate of 1 in 16. If you test two or three columns then it
would be 1 in 256 or 1 in 4,096.

If you're aiming at very large numbers of columns with very large
where clauses then perhaps you only need, say, 2 bits per column. But
for any number of bits per column I think you're always better off
just storing that many bits of the each hash rather than using a bloom
filter for this.

Also, as it stands now it's an unordered list. I assume you have no
plans to implement vacuum for this? Perhaps it would be better to
store a btree of signatures ordered by htid. That would let you easily
vacuum dead entries. Though it would necessitate random seeks to do
the full scan of the index unless you can solve the full index scan
concurrent with page splits problem.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2010-01-18 08:45:54 Re: Hot Standby and handling max_standby_delay
Previous Message Heikki Linnakangas 2010-01-18 06:28:13 Re: Hot Standby and handling max_standby_delay