Bloom Filter improvements in postgesql

From: Ross Heaney <rheaney26(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Bloom Filter improvements in postgesql
Date: 2025-07-03 15:00:29
Message-ID: CAJMh_DJQLmUONKyDhZPhpWboqRV_+i0JbwtADB1R6PZX9uacwA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

I would like to propose some improvements to the bloom filter
implementation that exists in postgres at the moment. I will describe the
changes along with a patch with some tests. I wanted to get feedback before
I spend more time on this so the patch is not production ready(WIP) but is
sufficient for feedback. The patch code compiles and tests successfully and
doesn't break anything(to the best of my knowledge!)

*Problem Statement*
The existing bloom filter uses an integer number of hash functions. This is
completely fine but we can do better. In a recent paper
<https://arxiv.org/abs/2502.02193> its described how bloom filters can be
tuned to use non-integer values of hash functions. This is
particularly nice as often the optimal number of hash functions to use is
not an integer number. When the hash functions(K) to be used is calculated,
it could be something like 2.3. Then we would round up to 3 and use k=3
hash functions for the bloom filter. This means that the filter size will
be ceil(k)/k = 1/k times larger than the optimal filter size. If we take
floor(k) instead we get an increase in the filter's density which increases
the false positive rate which may not be desirable.

*Proposed Improvements *
Instead we can relax the assumption that it must be an integer value for k.
To do this we split the k value into an integer and a rational part. Call
this k and r. So in the example above where we calculated k to be equal to
2.3 this would mean k = 2 and r = 0.3. We always use k hash functions but
we apply a third hash function 30% of the time(corresponding to r = 0.3 in
this case). This is done deterministically so the same element always gets
the same treatment(keeps the bloom filter promise of no false negatives).
The paper has more information but this is the gist of the rational part of
the bloom filter.

However we can take this further and look to make the bloom filter
lossless. Using the rational part described above we've given ourselves
some wriggle room so to speak. We can compliment the rational 'lossy' bloom
filter with an additional array called a witness. This allows us to
reconstruct the original input, which just means a false positive rate of
0. The witness(W) is constructed concurrently with the bloom filter bitmap.
The witness encodes information for every element that passes the
initial bloom filter test(both the true and false positives). Clearly this
establishes a direct relationship between the false positive rate and the
size of the witness. As this is all done deterministically the decoding
process is just the reverse of the encoding process. We can easily figure
out true positives from false positives by checking the witness. By
extension of the deterministic process there can be no false negatives.

*Tradeoffs and Considerations*
As stated in the previous section there is a direct relationship between
the false positive rate and the size of the witness. The lossless
capability comes at the price of more memory for the witness. The expected
witness size can be calculated t * (1 + r ^2) where t = true positives and
r = 1/ln2. Let's give a concrete example

t = 1000000 (a million items in the filter)
fp = 0.01
size of the filter = 9,585,059 (bits or approx 1.14 mb)
k = 6.65 (assuming we use rational number of hfc's)

For a bloom filter setup for these parameters we can expect a witness size
of approximately 3,000,000 bits. The total space complexity is still 0(N).
The main tradeoffs are of course the memory overhead of the witness. I have
done some preliminary benchmarking and as expected the use of the witness
does slow down inserts and lookups. The insert speed is around 3x slower
and the lookup speed 2x slower. I definitely think this can be made faster
but in any case it will always be slower than the traditional bloom filter
if we wanted to also have a 'lossless' option.

*Final thoughts and comments*
The rational bloom filter would be a nice improvement to the existing bloom
filter implementation. The lossless option could be useful in some
scenarios at the expense of extra memory and slower lookup and insertion
times. I have attached two patches showing a WIP of the changes(mainly to
do with the lossless option). I would appreciate any feedback on whether
any/all of this proposal would be considered. I'd be happy to make the
changes myself.

Attachment Content-Type Size
0001-create-an-improved-filter-with-a-zero-fp-rate-for-ve.patch application/octet-stream 32.4 KB
0002-update-to-use-rational-hashing.-Still-working-out-so.patch application/octet-stream 2.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo Nagata 2025-07-03 15:04:02 Improve error message for duplicate labels in enum types
Previous Message Álvaro Herrera 2025-07-03 14:58:46 Re: Please update your contact list: It must be pgsql-hackers@lists.postgresql.org