Re: WIP: bloom filter in Hash Joins with batches

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: bloom filter in Hash Joins with batches
Date: 2015-12-20 04:46:56
Message-ID: CAF4Au4ybPY74sv5r9k9b5FuG8+oTtp3kGVqHZEVO5XsNGbqSGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tomas,

have you seen
http://www.postgresql.org/message-id/4B4DD67F.9010506@sigaev.ru
I have very limited internet connection (no graphics) , so I may miss
something

Oleg

On Wed, Dec 16, 2015 at 4:15 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> while working on the Hash Join improvements, I've been repeatedly running
> into the idea of bloom filter - various papers on hash joins mention bloom
> filters as a way to optimize access to the hash table by doing fewer
> lookups, etc.
>
> Sadly, I've been unable to actually see any real benefit of using a bloom
> filter, which I believe is mostly due to NTUP_PER_BUCKET=1, which makes the
> lookups much more efficient (so the room for bloom filter improvements is
> narrow).
>
> The one case where bloom filter might still help, and that's when the
> bloom filter fits into L3 cache (a few MBs) while the hash table (or more
> accurately the buckets) do not. Then there's a chance that the bloom filter
> (which needs to do multiple lookups) might help.
>
> But I think there's another case where bloom filter might be way more
> useful in Hash Join - when we do batching. What we do currently is that we
> simply
>
> 1) build the batches for the hash table (inner relation)
>
> 2) read the outer relation (usually the larger one), and split it
> into batches just like the hash table
>
> 3) while doing (2) we join the first batch, and write the remaining
> batches to disk (temporary files)
>
> 4) we read the batches one by one (for both tables) and do the join
>
> Now, imagine that only some of the rows in the outer table actually match
> a row in the hash table. Currently, we do write those rows into the
> temporary file, but with a bloom filter on the whole hash table (all the
> batches at once) we can skip that for some types of joins.
>
> For inner join we can immediately discard the outer row, for left join we
> can immediately output the row. In both cases we can completely eliminate
> the overhead with writing the tuple to the temporary file and then reading
> it again.
>
> The attached patch is a PoC of this approach - I'm pretty sure it's not
> perfectly correct (e.g. I only tried it with inner join), but it's good
> enough for demonstrating the benefits. It's rather incomplete (see the end
> of this e-mail), and I'm mostly soliciting some early feedback at this
> point.
>
> The numbers presented here are for a test case like this:
>
> CREATE TABLE dim (id INT, dval TEXT);
> CREATE TABLE fact (id INT, fval TEXT);
>
> INSERT INTO dim SELECT i, md5(i::text)
> FROM generate_series(1,10000000) s(i);
>
> -- repeat 10x
> INSERT INTO fact SELECT * FROM dim;
>
> and a query like this
>
> SELECT COUNT(fval) FROM fact JOIN dim USING (id) WHERE dval < 'a';
>
> with different values in the WHERE condition to select a fraction of the
> inner 'dim' table - this directly affects what portion of the 'fact' table
> has a matching row, and also the size of the hash table (and number of
> batches).
>
> Now, some numbers from a machine with 8GB of RAM (the 'fact' table has
> ~6.5GB of data, so there's actually quite a bit of memory pressure, forcing
> the temp files to disk etc.).
>
> With work_mem=16MB, it looks like this:
>
> batches filter select. bloom master bloom/master
> -----------------------------------------------------------
> 4 1 6.25% 23871 48631 49.09%
> 8 2 12.50% 25752 56692 45.42%
> 8 3 18.75% 31273 57455 54.43%
> 16 4 25.01% 37430 62325 60.06%
> 16 5 31.25% 39005 61143 63.79%
> 16 6 37.50% 46157 63533 72.65%
> 16 7 43.75% 53500 65483 81.70%
> 32 8 49.99% 53952 65730 82.08%
> 32 9 56.23% 55187 67521 81.73%
> 32 a 62.49% 64454 69448 92.81%
> 32 b 68.73% 66937 71692 93.37%
> 32 c 74.97% 73323 72060 101.75%
> 32 d 81.23% 76703 73513 104.34%
> 32 e 87.48% 81970 74890 109.45%
> 32 f 93.74% 86102 76257 112.91%
>
> The 'batches' means how many batches were used for the join, 'filter' is
> the value used in the WHERE condition, selectivity is the fraction of the
> 'dim' table that matches the condition (and also the 'fact'). Bloom and
> master are timings of the query in miliseconds, and bloom/master is
> comparison of the runtimes - so for example 49% means the hash join with
> bloom filter was running ~2x as fast.
>
> Admittedly, work_mem=16MB is quite low, but that's just a way to force
> batching. What really matters is the number of batches and selectivity (how
> many tuples we can eliminate using the bloom filter).
>
> For work_mem=64MB it looks like this:
>
> batches filter select. bloom master bloom/master
> -----------------------------------------------------------
> 1 1 6.25% 24846 23854 104.16%
> 2 2 12.50% 24369 45672 53.36%
> 2 3 18.75% 30432 47454 64.13%
> 4 4 25.01% 36175 59741 60.55%
> 4 5 31.25% 43103 62631 68.82%
> 4 6 37.50% 48179 64079 75.19%
>
> So initially it's a bit slower (it's not doing any batching in this case,
> but the code is a bit silly and while not building the bloom filter it
> still does some checks). But once we start batching, it gets 2x as fast
> again, and then slowly degrades as the selectivity increases.
>
> Attached is a spreadsheet with results for various work_mem values, and
> also with a smaller data set (just 30M rows in the fact table), which
> easily fits into memory. Yet it shows similar gains, shaving off ~40% in
> the best case, suggesting that this is not just thanks to reduction of I/O
> when forcing the temp files to disk.
>
> As I mentioned, the patch is incomplete in several ways:
>
> 1) It does not count the bloom filter (which may be quite big) into
> work_mem properly.
>
> 2) It probably does not work for outer joins at this point.
>
> 3) Currently the bloom filter is used whenever we do batching, but it
> should really be driven by selectivity too - it'd be good to (a)
> estimate the fraction of 'fact' tuples having a match in the hash
> table, and not to do bloom if it's over ~60% or so. Also, maybe
> the could should count the matches at runtime, and disable the
> bloom filter if we reach some threshold.
>
> But overall, this seems like a nice optimization opportunity for hash
> joins on large data sets, where batching is necessary.
>
> Ideas?
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mithun Cy 2015-12-20 06:47:30 Re: POC: Cache data in GetSnapshotData()
Previous Message David Rowley 2015-12-20 01:21:14 Re: An unlikely() experiment