Re: Hash Joins vs. Bloom Filters / take 2

From: Patrick Krecker <pkrecker(at)gmail(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Date: 2018-03-07 00:27:16
Message-ID: CACh_hd7wg4iEcnD_RJ9ZiQRHNPjonf4=5SR3uzpVgJ0yVPc_xQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 1, 2018 at 4:04 PM, David Steele <david(at)pgmasters(dot)net> wrote:
> On 3/1/18 6:52 PM, Tomas Vondra wrote:
>>
>> On 03/02/2018 12:31 AM, Andres Freund wrote:
>>>
>>>
>>>
>>> On March 1, 2018 3:22:44 PM PST, Tomas Vondra
>>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>>
>>>>
>>>>
>>>> On 03/01/2018 11:01 PM, Andres Freund wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>>>>>>
>>>>>> So I've decided to revive the old patch, rebase it to current
>>>>
>>>> master,
>>>>>>
>>>>>> and see if we can resolve the issues that killed it in 2016.
>>>>>
>>>>>
>>>>> There seems to be some good discussion in the thread. But the patch
>>>>> arrived just before the last commitfest and certainly isn't a trivial
>>>>> cleanup patch. Therefore I think it should be moved to the next CF?
>>>>>
>>>>
>>>> It isn't a massive invasive patch either, though, so I object to moving
>>>> it to 2018-09 right away.
>>>
>>>
>>> Why do we have rules around not submitting large stuff to the last
>>> cf, if we just not follow through? We're neck deep in patches that
>>> are older. And you've already gotten a fair bit of feedback..
>>>
>>
>> It was not my intention to break (or even bend) the CF rules, of course.
>> I haven't considered the patch to be "large stuff", while you do. I see
>> Peter Geoghegan agrees with your conclusion on another thread, so go
>> ahead and move it to 2018-09.
>
>
> After reviewing the thread I also agree that this should be pushed to
> 2018-09, so I have done so.
>
> I'm very excited by this patch, though. In general I agree with Peter that
> a higher rate of false positives is acceptable to save memory. I also don't
> see any reason why this can't be tuned with a parameter. Start with a
> conservative default and allow the user to adjust as desired.
>
> --
> -David
> david(at)pgmasters(dot)net
>

Hi All --

I'm curious what has to be done to move this patch along. I looked
through the patched and I noticed that the work for deciding whether to
instantiate the bloom filter in single-batch mode is not complete yet
(or if it's in this change, I can't find it), contrary to this
comment:

+ * When starting in a single-batch mode, we do nothing initially.
+ * If the whole hash table fits into a single batch, we can get
+ * sufficiently accurate ndistinct estimate by simply counting
+ * occupied buckets (thanks to shooting for NTUP_PER_BUCKET=1),
+ * or perhaps we could use something more elaborate (e.g. HLL).
+ * But we only build the bloom filter if the hash table is large
+ * enough to exceed on-CPU caches (e.g. 4MB).

I did some basic benchmarking with two tables and a simple where
clause filtering 90% of rows and it does yield about a 1.2x speedup in
my tests. In the pathological case where two tables are joined on a
pk/fk relationship with no filtering, the penalty seems to be about
1-2%.

Patrick

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-03-07 01:31:07 Re: [HACKERS] [Patch] Log SSL certificate verification errors
Previous Message Kyotaro HORIGUCHI 2018-03-07 00:26:38 Re: Index-only scan returns incorrect results when using a composite GIST index with a gist_trgm_ops column.