Re: WIP: bloom filter in Hash Joins with batches

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: bloom filter in Hash Joins with batches
Date: 2016-01-10 04:11:01
Message-ID: CAM3SWZTW-=EK0TfuoJuq5X1Xg=05k_vHeCrPMfvYXTDHqK0yLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Which means the "dim.r" column has 100 different values (0-99) with uniform
> distribution. So e.g. "WHERE r < 15" matches 15%.

I think that the use of a uniform distribution to demonstrate this
patch is a bad idea, unless you want to have a conversation about the
worst case.

Look at the use cases for bloom filters in general. They're almost all
some variation on the same theme: checking a bloom filter
inexpensively, to avoid an expensive cache miss (of some fashion). The
Google Chrome web browser uses a bloom filter for malicious URLs. It
usually avoids consulting Google's servers about whether or not any
given URL that is visited is malicious by using the bloom filter. This
is effective in large part because the top 100 websites ranked by
popularity have a huge falloff in popularity as you go down the list.
It looks like a Zipfian distribution, and so I imagine they get pretty
far with a very small bloom filter, even though in theory the chances
of any given valid URL resulting in a cache miss is very high.

Generally, uniform distributions are rare in a non-canonical list of
things, like a hash join outer relation's attribute.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-01-10 05:29:39 ExecGather() + nworkers
Previous Message Peter Geoghegan 2016-01-10 03:03:48 Re: WIP: bloom filter in Hash Joins with batches