Re: Bloom Filter improvements in postgesql

From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Ross Heaney <rheaney26(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bloom Filter improvements in postgesql
Date: 2025-07-03 15:59:29
Message-ID: 5851D665-033E-4394-A7D3-1E9C9847B68E@justatheory.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

This is an interesting proposal, thanks for posting.

On Jul 3, 2025, at 11:00, Ross Heaney <rheaney26(at)gmail(dot)com> wrote:

> Proposed Improvements
> Instead we can relax the assumption that it must be an integer value for k. To do this we split the k value into an integer and a rational part. Call this k and r. So in the example above where we calculated k to be equal to 2.3 this would mean k = 2 and r = 0.3. We always use k hash functions but we apply a third hash function 30% of the time(corresponding to r = 0.3 in this case). This is done deterministically so the same element always gets the same treatment(keeps the bloom filter promise of no false negatives). The paper has more information but this is the gist of the rational part of the bloom filter.

Intuitively this makes sense to me. I skimmed the paper and my eyes rolled up into my head, but the basic idea certainly seems sound. I’m curious about the "Variably-Sized Block Bloom filter,” bit, though, which I don’t follow and doesn’t seem to be part of your proposal.

> However we can take this further and look to make the bloom filter lossless. Using the rational part described above we've given ourselves some wriggle room so to speak. We can compliment the rational 'lossy' bloom filter with an additional array called a witness. This allows us to reconstruct the original input, which just means a false positive rate of 0.

I have a bit of a harder time with the witness. I get that the size is limited, so it won’t grow to ginormous size based on the number of entries, but don’t understand how it could be so small (adding around 25% of additional memory in your example) unless it, too, accepts a certain false positivity rate. In which case I wonder whether there is a comparable reduction in the false positive rate by simply increasing the size of the bloom filter itself by 25%.

But since you suggest that the false positive rate can be reduced to effectively zero, I must be missing something. I’d be curious to read more on this pattern.

Best,

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-07-03 16:01:23 Re: libpq OpenSSL and multithreading
Previous Message Zhou, Zhiguo 2025-07-03 15:38:32 Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks