Quick Links

Re: Hash Joins vs. Bloom Filters / take 2

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hash Joins vs. Bloom Filters / take 2
Date:	2018-02-22 02:37:44
Message-ID:	f7faf7d3-9677-5b11-0b7f-25135465a638@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 02/21/2018 08:17 AM, Thomas Munro wrote:
> On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> In 2015/2016 I've been exploring if we could improve hash joins by
>> leveraging bloom filters [1], and I was reminded about this idea in a
>> thread about amcheck [2]. I also see that bloom filters were briefly
>> mentioned in the thread about parallel hash [3].
>>
>> So I've decided to revive the old patch, rebase it to current
>> master, and see if we can resolve the issues that killed it in
>> 2016.
>
> Nice!
>
>> Opinions?
>
> I'm definitely following this and interested in helping in some way
> if I can. I have wondered about this subject and discussed it a bit
> with Peter Geoghegan off-list.
>

Good ;-)

I think one important thing we need to figure out is the costing, or
some other way that would allow us to decide when to build the Bloom
filters (and what perhaps whether to prefer larger and more accurate
one, or a smaller one).

But if you want to look into adding support for parallel hash, or
pushing the bloom filter down to the scans, feel free to do so.

> Some assorted thoughts:
>
> In the old thread, Peter pointed at a curious undergrad student
> project from 2008[1] evaluating Bloom filters for hash joins in
> PostgreSQL 8.3, inspired by a couple of older papers[2][3]. While
> your patch uses a Bloom filter to short-circuit the regular bucket
> probe in ExecHashJoinImpl(), these approach push the Bloom filter down
> into the outer relation scan. I suspect you're right about the fixed
> sizing being a problem, but the general idea seems pretty interesting
> to me and there seems to be no reason you couldn't make the filter
> size dynamic as you have it and then share it via a parameter or
> something. But is there any point?
>
> On the one hand, pushing down Bloom filters requires the hash value
> to be computed by the lower scan, and then computed again if the
> tuple survives the filter and makes it into the Hash Join node
> (unless there is some way to attach it to the tuple...). On the
> other hand, throwing away tuples sooner can avoid more work,
> particularly in the case of multi-joins.
>

I do agree it's an interesting idea, and being able to push the filter
down would be great, particularly in case of very selective joins (i.e.
when many outer rows have no match in the hash table). I have no idea
how much infrastructure would it require, though, or how widely it could
be used.

Judging by your thoughts on impact of left-deep vs. right-deep joins
etc. you've already given this far more thought that I did ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Hash Joins vs. Bloom Filters / take 2 at 2018-02-21 07:17:54 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Langote	2018-02-22 02:52:07	Re: [HACKERS] Add support for tuple routing to foreign partitions
Previous Message	Tomas Vondra	2018-02-22 02:21:20	Re: Hash Joins vs. Bloom Filters / take 2