Re: Hash Joins vs. Bloom Filters / take 2

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Date: 2018-02-20 23:48:11
Message-ID: CAGTBQpaAan3G9Fznc-2eGostJ7VYRQOELRUdNKuZ6-MLrKPJ=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 20, 2018 at 8:23 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> I've worked a lot with bloom filters, and for large false positive
>> rates and large sets (multi-million entries), you get bloom filter
>> sizes of about 10 bits per distinct item.
>
> It's generally true that you need 9.6 bits per element to get a 1%
> false positive rate. 1% could be considered much too low here.
>
> Do we need to eliminate 99% of all hash join probes (that find nothing
> to join on) to make this Bloom filter optimization worthwhile?
> Personally, I doubt it.

Even for 90% it's about 4.6 bits per element.

What I'm saying is that you can't assume you can fit the whole thing
in work_mem. If it could be said that any set worth a hash join will
fit, you can build a work_mem-sized bloom filter and just accept that
if you exceed its capacity its filtering efficiency will drop
benignly. But IMO that can't be taken for granted, so at some point
you should just give up, the overhead of querying a low-quality BF
won't be worth it.

The HLL is good for that. You can keep adding to the BF until the HLL
tells you you're way past the capacity of a work_mem-sized BF, then
you free that memory and go on without it.

You might also want to consider scalable BFs. The smaller you can keep
the BF, the better chance you'll have of fitting it into the L3 cache
(you can fit quite a lot of entries into a decen-sized L3 cache).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-02-20 23:54:40 Re: Hash Joins vs. Bloom Filters / take 2
Previous Message Tom Lane 2018-02-20 23:47:59 Re: ALTER TABLE ADD COLUMN fast default