Re: Hash Joins vs. Bloom Filters / take 2

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Date: 2018-02-20 23:23:58
Message-ID: CAH2-Wz=FZ8r=qDX3hz5ctaDNts1+bdeVNYhXB+m+dr79XYTajQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> I've worked a lot with bloom filters, and for large false positive
> rates and large sets (multi-million entries), you get bloom filter
> sizes of about 10 bits per distinct item.

It's generally true that you need 9.6 bits per element to get a 1%
false positive rate. 1% could be considered much too low here.

Do we need to eliminate 99% of all hash join probes (that find nothing
to join on) to make this Bloom filter optimization worthwhile?
Personally, I doubt it.

> That's efficient, but it's not magic. It can still happen that the
> whole set can't fit in work_mem with an acceptable false positive
> rate.

A merge join is always going to be the better choice when extremely
memory constrained.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message R, Siva 2018-02-20 23:30:35 Duplicate Item Pointers in Gin index
Previous Message Claudio Freire 2018-02-20 23:17:28 Re: Hash Joins vs. Bloom Filters / take 2