Re: Hash Joins vs. Bloom Filters / take 2

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Joins vs. Bloom Filters / take 2
Date: 2018-02-22 02:21:20
Message-ID: ec86dff3-41b6-6f5d-121c-95f23c763859@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I suspect that it could make sense to use a Bloom filter to
>>> summarize the entire inner side of the join all at once, even
>>> when there are multiple batches. I also suspect that this is
>>> particularly beneficial with parallel hash joins, where
>>> IPC/synchronization overhead can be a big problem.
>>>
>>
>> But that's what the patch does, currently - the filter is built
>> during the initial pass through the data, and then used for all
>> batches.
>
> I misunderstood. I would probably do something like double or triple
> the original rows estimate instead, though. The estimate must be at
> least slightly inaccurate when we get to this point, but I don't
> think that that's a good enough reason to give up on the estimate
> completely.
>

That's a problem only for the multi-batch case, though.

With a single batch we can walk the hash table and count non-empty
buckets, to get a good ndistinct estimate cheaply. And then size the
filter considering both memory requirements (fits into CPU cache) and
false positive rate. There are other things we may need to consider
(memory usage vs. work_mem) but that's a separate issue.

With multiple batches I think we could use the "size the bloom filter
for a fraction of work_mem" which the current patch uses when switching
to multiple batches halfway-through. That pretty much entirely ignores
the estimate and essentially replaces it with a "fictional" estimate.

I think that's a better approach than using some arbitrary multiple of
the estimate. When we have to start batching halfway through, the
estimate is proven to be rather bogus anyway, but we may treat it as a
lower boundary for the bloom filter size.

>> Actually, now that I think about it - I think the patch should
>> throw away the filter away after the initial pass over the outer
>> relation, because at that point we've used all the information in
>> the filter.
>
> Makes sense.
>

Actually, the patch already does that - it stops using the filter if
(curbatch != 0). We don't throw it away, though, because it also
includes some additional instrumentation that are shown by explain analyze.

>> I'm not sure it would make sense to then build smaller bloom
>> filters for individual batches, but maybe it would?
>
> I doubt it.
>

I think it might help if the global bloom filter ended up having high
false positive rate. But only if the per-batch filters fit into CPU
cache (i.e. it's the same reasoning as for single-batch case).

But those "per-batch" filters are rather incompatible with pushing the
filter to scan nodes, I think.

>> Yeah, I admit those are rather crude rules.
>
> You have to start somewhere.
>
>> The trouble is that when we start with a single batch and then find
>> out the estimate was wrong and we need to start batching, all bets
>> are off. At that point it seems reasonable to just say "Here is X
>> MBs of RAM, do what you can".
>
> As I said above, I wouldn't say all bets are off when this happens
> -- not at all. Estimates are likely to often be somewhat wrong. If
> they're completely wrong, we can probably swallow the cost of giving
> up on a Bloom filter relatively late.
>
> As I said, X should not be a portion of work_mem, because that has
> only a weak relationship to what really matters.
>

I agree a fixed fraction of work_mem may not be the right thing, but the
goal was to make the bloom filter part of the Hash memory budget, i.e.

bloom filter + hash table <= work_mem

(which I think we agree should be the case), without increasing the
number of batches too much. For example, if you size the filter ignoring
this, and it end up being 90% of work_mem, you may need to do the hash
join in 128 batches instead of just 16. Or something like that.

Maybe that would still be a win, though. Firstly, the higher number of
batches may not have a huge impact - in one case we need to serialie
15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
accurate filter allows us to discard much more data from the outer
relation ...

>>> You should try to exploit the fact that a Bloom filter can summarize a
>>> large set reasonably well with a very compact, simple representation.
>>> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
>>> for cases where Bloom probes will save a lot of work, it probably
>>> doesn't make all that much difference -- the hash join is still much
>>> faster.
>
>> But the problem is that I don't know what is the total size of the
>> hash table, because we're building the bloom filter for all the
>> batches at once. And we don't know how many batches will be there -
>> if we knew that, we could estimate the number of distinct values
>> and we could use that to size the filter instead of doing this.
>> (All of this only applies to the state where we start with a single
>> batch and then find out we need to start batching, of course.)
>
> I don't think that the number of batches should matter that much to
> the costing/sizing/estimation logic, even if it's an interesting
> thing to talk about when considering the upsides of a Bloom filter.
> My sense is that we should focus on making sure that using a Bloom
> filter is going to win, and not worry so much about whether that's
> going to be a huge win or a modest win.
>
> Suppose that you thought you were going to have a 10% false positive
> rate with a 22.85MB Bloom filter for 40 million elements (my earlier
> example). Further suppose that it turns out to be 80 million
> elements. This means that you're going to get a false positive rate
> of 30%. This could still be a big win, though, so it's not really a
> bad situation. With 120 million elements, it's about 45%, but even
> then it could still work out, especially because the hash join will
> be very slow anyway. You also have to bear in mind that the 40
> million estimate is much more likely to be too high than too low,
> because you're assuming distinct key values for the hash table.
>
> You're taking a risk, in a sense, but a lot of things have to go
> wrong for you to lose, and even then you can cut your losses before
> the extra cost is too high.
>
> Do you have a test query for this patch, that gives us some idea of
> the upside?
>

I have to admit I've been using only some simplistic ad-hoc queries.
There was a more detailed analysis in the old thread, but I'm not sure
how relevant it still is.

So I did some measurements on a simple join, with different work_mem
values and join selectivity.

select count(*)
from fact join dim on (dim.a = fact.a and dim.b < :sel)

Attached are results for "small" data set 20M:1M, the full results and
scripts are available here:

https://github.com/tvondra/hashjoin-bloom-tests

The first list shows a summary of the results, with timings for

a) master
b) patched master (with bloom filters disabled)
c) patched master (with bloom filters used always)

The last two tables compare b/a and c/a. The first one shows that
there's 0-3% overhead when bloom filters are not used (but it might
easily be just noise or differences in the layout of the binary).

The second one is the more interesting one. It shows a couple of things:

a) For tiny hash tables there's about 11% overhead. 1% selectivity means
the hash table has only 10000 entries, which fits into ~483kB. This is
why I think we need rule that for small hash tables we don't need bloom
filters.

b) For low selectivity (70% or more rows get into the hash table), the
bloom filter is a net loss, costing up to ~11%. This is why we should
consider selectivity of the join, I think.

c) For selectivity between 10% and 70% it's a net win, with speedups
between ~10% (single batch) and ~20% (multi-batch).

Those are relatively modest improvements, I expect more significant
gains on the large data set.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
summary.ods application/vnd.oasis.opendocument.spreadsheet 55.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-02-22 02:37:44 Re: Hash Joins vs. Bloom Filters / take 2
Previous Message Masahiko Sawada 2018-02-22 01:26:22 Re: Duplicate Item Pointers in Gin index