Re: Hash Joins vs. Bloom Filters / take 2

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Date: 2018-02-20 23:54:40
Message-ID: 88b8aff2-5be3-e932-af80-b3cd48b5084c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/21/2018 12:06 AM, Peter Geoghegan wrote:
> On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> In 2015/2016 I've been exploring if we could improve hash joins by
>> leveraging bloom filters [1], and I was reminded about this idea in a
>> thread about amcheck [2]. I also see that bloom filters were briefly
>> mentioned in the thread about parallel hash [3].
>>
>> So I've decided to revive the old patch, rebase it to current master,
>> and see if we can resolve the issues that killed it in 2016.
>
> Cool.
>
>> The issues are essentially about predicting when the bloom filter can
>> improve anything, which is more a question of the hash join selectivity
>> than the bloom filter parameters.
>
> Absolutely. A Bloom filter could be considered an opportunistic thing.
> The false positive rate that you estimate is going to be based on the
> planner's rowcount estimate for the inner side of the join, which is
> liable to be wrong anyway. It's important to be resilient to
> misestimations, which Bloom filters are good at.
>

It's a bit more difficult than that, because rowcount is the total
number of rows, but it may not be the same as the number of distinct
values in the join key. But it's an estimate of the upper boundary.

But yeah, that's how the bloom filter is sized in the patch.

>> This is not the same as "false positive rate" which is a feature
>> of the bloom filter itself, and can be measured by simply counting
>> bits set in the filter. That is important too, of course, but
>> simple to determine.
>
> Perhaps I'm being pedantic, but it's not exactly true that you can
> measure the false positive rate by counting the bits. I do agree with
> what I think you meant here, though, which is that the precise false
> positive rate is not necessarily that important, while the
> selectivity of the join is very important.
>

My reasoning is that given a bloom filter with K out of M bits set, a
probability of the bloom filter saying "true" to a random value is

(K/M)^H

where H is the number of hash functions. I believe that's pretty much
the definition of false positive rate, but I have to admit I haven't
really checked the exact math.

The important takeaway here is that many bits set -> bad.

> In general, hash joins work best when the join qual is highly
> selective. This is not the same thing as having a small hash table,
> of course.
>
>> After thinking about this a bit more, I think this is pretty much
>> what we do during join cardinality estimation - we estimate how
>> many of the rows in "fact" have a matching row in "dim". I do think
>> we can reuse that to decide if to use a bloom filter or not.
>>
>> Of course, the decision may need to be more elaborate (and perhaps
>> formulated as part of costing), but the filter selectivity is the
>> crucial piece. The attached patch does not do that yet, though.
>
> I suspect that it could make sense to use a Bloom filter to
> summarize the entire inner side of the join all at once, even when
> there are multiple batches. I also suspect that this is particularly
> beneficial with parallel hash joins, where IPC/synchronization
> overhead can be a big problem.
>

But that's what the patch does, currently - the filter is built during
the initial pass through the data, and then used for all batches. This
comes from the original idea to throw away rows from the outer relation
(usually the larger one) that have no match in the hash table. Now we
stash them into a batch file, possibly shuffling them between the
batches if we happen to need to add more batches, only to throw them
away much later when we get to that batch.

Actually, now that I think about it - I think the patch should throw
away the filter away after the initial pass over the outer relation,
because at that point we've used all the information in the filter.

I'm not sure it would make sense to then build smaller bloom filters for
individual batches, but maybe it would?

>> 4) Initially, the patch was aimed at hashjoins with batches, on the
>> premise that it can save some of the serialize/deserialize costs. But as
>> mentioned in the discussion, bloom filters can be beneficial even in the
>> single-batch case, when three conditions are met:
>>
>> a) join is selective (i.e. some rows don't have match in hash table)
>>
>> b) hash table > CPU cache
>>
>> c) bloom filter < CPU cache
>
> Seems plausible. CPU cache efficiency is also affected by how many
> hash functions you end up using -- use too many, and it slows things
> down.
>

Yeah, I admit those are rather crude rules.

>> We don't have a good way to determine size of the CPU cache, but I
>> think we can use some crude arbitrary limits. E.g. require that
>> hash hash table is larger than 8MB and bloom filter is smaller than
>> 4MB, or something like that.
>
> FWIW, the logical way to size the Bloom filter is based on the number
> of (distinct) values, not based on the projected size of the hash
> table. The lossy nature of Bloom filters makes the average size of the
> original, hashed elements irrelevant to a sizing calculation that
> targets a certain final false positive rate. Making Bloom filter size
> any fixed fraction of hash table size seems wrong to me for that
> reason alone.
>

Sure, and the patch does exactly that when we actually have a good
estimate (although, it's using expected number of rows of the inner
relation, not ndistinct).

The trouble is that when we start with a single batch and then find out
the estimate was wrong and we need to start batching, all bets are off.
At that point it seems reasonable to just say "Here is X MBs of RAM, do
what you can".

> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.
>

But the problem is that I don't know what is the total size of the hash
table, because we're building the bloom filter for all the batches at
once. And we don't know how many batches will be there - if we knew
that, we could estimate the number of distinct values and we could use
that to size the filter instead of doing this. (All of this only applies
to the state where we start with a single batch and then find out we
need to start batching, of course.)

In other words, I'm not claiming 1/8 of a hash table is a reasonable
size for bloom filter, but that 1/8 of work_mem seems like a reasonable
compromise for sizing the bloom filter. Perhaps there should be some
upper boundary for very large work_mem values, though.

Regarding false positive rate - I agree it may be more efficient to use
a smaller bloom filter with worse false positive rate, particularly if
the smaller one fits into a CPU cache and the larger one does not.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-02-21 00:19:50 Re: Hash Joins vs. Bloom Filters / take 2
Previous Message Claudio Freire 2018-02-20 23:48:11 Re: Hash Joins vs. Bloom Filters / take 2