Re: Bloom filter Pushdown Optimization for Merge Join

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Zheng Li <zhengli10(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Bloom filter Pushdown Optimization for Merge Join
Date: 2022-10-03 16:14:13
Message-ID: 7a7d2271-7dc7-cb3a-a095-07bc1267d1b9@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Zheng Li,

Great to see someone is working on this! Some initial comments/review:

On 10/1/22 00:44, Zheng Li wrote:
> Hello,
>
> A bloom filter provides early filtering of rows that cannot be joined
> before they would reach the join operator, the optimization is also
> called a semi join filter (SJF) pushdown. Such a filter can be created
> when one child of the join operator must materialize its derived table
> before the other child is evaluated.
>
> For example, a bloom filter can be created using the the join keys for
> the build side/inner side of a hash join or the outer side of a merge
> join, the bloom filter can then be used to pre-filter rows on the
> other side of the join operator during the scan of the base relation.
> The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
> discussion on using such optimization for hash join without going into
> the pushdown of the filter where its performance gain could be further
> increased.
>

Agreed. That patch was beneficial for hashjoins with batching, but I
think the pushdown makes this much more interesting.

> We worked on prototyping bloom filter pushdown for both hash join and
> merge join. Attached is a patch set for bloom filter pushdown for
> merge join. We also plan to send the patch for hash join once we have
> it rebased.
>
> Here is a summary of the patch set:
> 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
> during the table scan instead of later on.
> -The bloom filter is pushed down along the execution tree to
> the target SeqScan nodes.
> -Experiments show that this optimization can speed up Merge
> Join by up to 36%.
>

Right, although I think the speedup very much depends on the data sets
used for the tests, and can be made arbitrarily large with "appropriate"
data set.

> 2. The planner makes the decision to use the bloom filter based on the
> estimated filtering rate and the expected performance gain.
> -The planner accomplishes this by estimating four numbers per
> variable - the total number of rows of the relation, the number of
> distinct values for a given variable, and the minimum and maximum
> value of the variable (when applicable). Using these numbers, the
> planner estimates a filtering rate of a potential filter.
> -Because actually creating and implementing the filter adds
> more operations, there is a minimum threshold of filtering where the
> filter would actually be useful. Based on testing, we query to see if
> the estimated filtering rate is higher than 35%, and that informs our
> decision to use a filter or not.
>

I agree, in principle, although I think the current logic / formula is a
bit too crude and fitted to the simple data used in the test. I think
this needs to be formulated as a regular costing issue, considering
stuff like cost of the hash functions, and so on.

I think this needs to do two things:

1) estimate the cost of building the bloom filter - This shall depend on
the number of rows in the inner relation, number/cost of the hash
functions (which may be higher for some data types), etc.

2) estimate improvement for the probing branch - Essentially, we need to
estimate how much we save by filtering some of the rows, but this also
neeeds to include the cost of probing the bloom filter.

This will probably require some improvements to the lib/bloomfilter, in
order to estimate the false positive rate - this may matter a lot for
large data sets and small work_mem values. The bloomfilter library
simply reduces the size of the bloom filter, which increases the false
positive rate. At some point it'll start reducing the benefit.

> 3. If using a bloom filter, the planner also adjusts the expected cost
> of Merge Join based on expected performance gain.
>

I think this is going to be a weak point of the costing, because we're
adjusting the cost of the whole subtree after it was costed.

We're doing something similar when costing LIMIT, and that can already
causes a lot of strange stuff with non-uniform data distributions, etc.

And in this case it's probably worse, because we're eliminating rows at
the scan level, without changing the cost of any of the intermediate
nodes. It's certainly going to be confusing in EXPLAIN, because of the
discrepancy between estimated and actual row counts ...

> 4. Capability to build the bloom filter in parallel in case of
> parallel SeqScan. This is done efficiently by populating a local bloom
> filter for each parallel worker and then taking a bitwise OR over all
> the local bloom filters to form a shared bloom filter at the end of
> the parallel SeqScan.
>

OK. Could also build the bloom filter in shared memory?

> 5. The optimization is GUC controlled, with settings of
> enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
>
> We found in experiments that there is a significant improvement
> when using the bloom filter during Merge Join. One experiment involved
> joining two large tables while varying the theoretical filtering rate
> (TFR) between the two tables, the TFR is defined as the percentage
> that the two datasets are disjoint. Both tables in the merge join were
> the same size. We tested changing the TFR to see the change in
> filtering optimization.
>
> For example, let’s imagine t0 has 10 million rows, which contain the
> numbers 1 through 10 million randomly shuffled. Also, t1 has the
> numbers 4 million through 14 million randomly shuffled. Then the TFR
> for a join of these two tables is 40%, since 40% of the tables are
> disjoint from the other table (1 through 4 million for t0, 10 million
> through 14 million for t4).
>
> Here is the performance test result joining two tables:
> TFR: theoretical filtering rate
> EFR: estimated filtering rate
> AFR: actual filtering rate
> HJ: hash join
> MJ Default: default merge join
> MJ Filter: merge join with bloom filter optimization enabled
> MJ Filter Forced: merge join with bloom filter optimization forced
>
> TFR EFR AFR HJ MJ Default MJ Filter MJ Filter Forced
> -------------------------------------------------------------------------------------
> 10 33.46 7.41 6529 22638 21949 23160
> 20 37.27 14.85 6483 22290 21928 21930
> 30 41.32 22.25 6395 22374 20718 20794
> 40 45.67 29.7 6272 21969 19449 19410
> 50 50.41 37.1 6210 21412 18222 18224
> 60 55.64 44.51 6052 21108 17060 17018
> 70 61.59 51.98 5947 21020 15682 15737
> 80 68.64 59.36 5761 20812 14411 14437
> 90 77.83 66.86 5701 20585 13171 13200
> Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two
> Tables of 10M Rows.
>
> Attached you can find figures of the same performance test and a SQL script
> to reproduce the performance test.
>
> The first thing to notice is that Hash Join generally is the most
> efficient join strategy. This is because Hash Join is better at
> dealing with small tables, and our size of 10 million is still small
> enough where Hash Join outperforms the other join strategies. Future
> experiments can investigate using much larger tables.
>
> However, comparing just within the different Merge Join variants, we
> see that using the bloom filter greatly improves performance.
> Intuitively, all of these execution times follow linear paths.
> Comparing forced filtering versus default, we can see that the default
> Merge Join outperforms Merge Join with filtering at low filter rates,
> but after about 20% TFR, the Merge Join with filtering outperforms
> default Merge Join. This makes intuitive sense, as there are some
> fixed costs associated with building and checking with the bloom
> filter. In the worst case, at only 10% TFR, the bloom filter makes
> Merge Join less than 5% slower. However, in the best case, at 90% TFR,
> the bloom filter improves Merge Join by 36%.
>
> Based on the results of the above experiments, we came up with a
> linear equation for the performance ratio for using the filter
> pushdown from the actual filtering rate. Based on the numbers
> presented in the figure, this is the equation:
>
> T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863)
>
> For example, this means that with an estimated filtering rate of 0.4,
> the execution time of merge join is estimated to be improved by 16.3%.
> Note that the estimated filtering rate is used in the equation, not
> the theoretical filtering rate or the actual filtering rate because it
> is what we have during planning. In practice the estimated filtering
> rate isn’t usually accurate. In fact, the estimated filtering rate can
> differ from the theoretical filtering rate by as much as 17% in our
> experiments. One way to mitigate the power loss of bloom filter caused
> by inaccurate estimated filtering rate is to adaptively turn it off at
> execution time, this is yet to be implemented.
>

IMHO we shouldn't make too many conclusions from these examples. Yes, it
shows merge join can be improved, but for cases where a hashjoin works
better so we wouldn't use merge join anyway.

I think we should try constructing examples where either merge join wins
already (and gets further improved by the bloom filter), or would lose
to hash join and the bloom filter improves it enough to win.

AFAICS that requires a join of two large tables - large enough that hash
join would need to be batched, or pre-sorted inputs (which eliminates
the explicit Sort, which is the main cost in most cases).

The current patch only works with sequential scans, which eliminates the
second (pre-sorted) option. So let's try the first one - can we invent
an example with a join of two large tables where a merge join would win?

Can we find such example in existing benchmarks like TPC-H/TPC-DS.

> Here is a list of tasks we plan to work on in order to improve this patch:
> 1. More regression testing to guarantee correctness.
> 2. More performance testing involving larger tables and complicated query plans.
> 3. Improve the cost model.

+1

> 4. Explore runtime tuning such as making the bloom filter checking adaptive.

I think this is tricky, I'd leave it out from the patch for now until
the other bits are polished. It can be added later.

> 5. Currently, only the best single join key is used for building the
> Bloom filter. However, if there are several keys and we know that
> their distributions are somewhat disjoint, we could leverage this fact
> and use multiple keys for the bloom filter.

True, and I guess it wouldn't be hard.

> 6. Currently, Bloom filter pushdown is only implemented for SeqScan
> nodes. However, it would be possible to allow push down to other types
> of scan nodes.

I think pushing down the bloom filter to other types of scans is not the
hard part, really. It's populating the bloom filter early enough.

Invariably, all the examples end up with plans like this:

-> Merge Join
Merge Cond: (t0.c1 = t1.c1)
SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
SemiJoin Estimated Filtering Rate: 1.0000
-> Sort
Sort Key: t0.c1
-> Seq Scan on t0
-> Materialize
-> Sort
Sort Key: t1.c1
-> Seq Scan on t1

The bloom filter is built by the first seqscan (on t0), and then used by
the second seqscan (on t1). But this only works because we always run
the t0 scan to completion (because we're feeding it into Sort) before we
start scanning t1.

But when the scan on t1 switches to an index scan, it's over - we'd be
building the filter without being able to probe it, and when we finish
building it we no longer need it. So this seems pretty futile.

It might still improve plans like

-> Merge Join
Merge Cond: (t0.c1 = t1.c1)
SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
SemiJoin Estimated Filtering Rate: 1.0000
-> Sort
Sort Key: t0.c1
-> Seq Scan on t0
-> Index Scan on t1

But I don't know how common/likely that actually is. I'd expect to have
an index on both sides, but perhaps I'm wrong.

This is why hashjoin seems like a more natural fit for the bloom filter,
BTW, because there we have a guarantee the inner relation is processed
first (so we know the bloom filter is fine and can be probed).

> 7. Explore if the Bloom filter could be pushed down through a foreign
> scan when the foreign server is capable of handling it – which could
> be made true for postgres_fdw.

Neat idea, but I suggest to leave this out of scope of this patch.

> 8. Better explain command on the usage of bloom filters.
>

I don't know what improvements you have in mind exactly, but I think
it'd be good to show which node is building/using a bloom filter, and
then also some basic stats (size, number of hash functions, FPR, number
of probes, ...). This may require improvements to lib/bloomfilter, which
currently does not expose some of the details.

> This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback
> is appreciated.
>

Attached is a patch series with two "review" parts (0002 and 0004). I
already mentioned some of the stuff above, but a couple more points:

1) Don't allocate memory directly through alloca() etc. Use palloc, i.e.
rely on our memory context.

2) It's customary to have "PlannerInfo *root" as the first parameter.

3) For the "debug" logging, I'd suggest to do it the way TRACE_SORT
(instead of inventing a bunch of dbg routines).

4) I find the naming inconsistent, e.g. with respect to the surrounding
code (say, when everything around starts with Exec, maybe the new
functions should too?). Also, various functions/variables say "semijoin"
but then we apply that to "inner joins" too.

5) Do we really need estimate_distincts_remaining() to implement yet
another formula for estimating number of distinct groups, different from
estimate_num_groups() does? Why?

6) A number of new functions miss comments explaining the purpose, and
it's not quite clear what the "contract" is. Also, some functions have
new parameters but the comment was not updated to reflect it.

7) SemiJoinFilterExamineSlot is matching the relations by OID, but
that's wrong - if you do a self-join, both sides have the same OID. It
needs to match RT index (I believe scanrelid in Scan node is what this
should be looking at).

There's a couple more review comments in the patches, but those are
minor and not worth discussing here - feel free to ask, if anything is
not clear enough (or if you disagree).

I did a bunch of testing, after tweaking your SQL script.

I changed the data generation a bit not to be so slow (instead of
relying on unnest of multiple large sets, I use one sequence and random
to generate data). And I run the tests with different parameters (step,
work_mem, ...) driven by the attached shell script.

And it quickly fails (on assert-enabled-build). I see two backtraces:

1) bogus overlapping estimate (ratio > 1.0)
...
#4 0x0000000000c9d56b in ExceptionalCondition (conditionName=0xe43724
"inner_overlapping_ratio >= 0 && inner_overlapping_ratio <= 1",
errorType=0xd33069 "FailedAssertion", fileName=0xe42bdb "costsize.c",
lineNumber=7442) at assert.c:69
#5 0x00000000008ed767 in evaluate_semijoin_filtering_rate
(join_path=0x2fe79f0, equijoin_list=0x2fea6c0, root=0x2fe6b68,
workspace=0x7ffd87588a78, best_clause=0x7ffd875888cc,
rows_filtered=0x7ffd875888c8) at costsize.c:7442

Seems it's doing the math wrong, or does not expect some corner case.

2) stuck spinlock in SemiJoinFilterFinishScan
...
#5 0x0000000000a85cb0 in s_lock_stuck (file=0xe68c8c "lwlock.c",
line=907, func=0xe690a1 "LWLockWaitListLock") at s_lock.c:83
#6 0x0000000000a85a8d in perform_spin_delay (status=0x7ffd8758b8e8) at
s_lock.c:134
#7 0x0000000000a771c3 in LWLockWaitListLock (lock=0x7e40a597c060) at
lwlock.c:911
#8 0x0000000000a76e93 in LWLockConflictsWithVar (lock=0x7e40a597c060,
valptr=0x7e40a597c048, oldval=1, newval=0x7e40a597c048,
result=0x7ffd8758b983) at lwlock.c:1580
#9 0x0000000000a76ce9 in LWLockWaitForVar (lock=0x7e40a597c060,
valptr=0x7e40a597c048, oldval=1, newval=0x7e40a597c048) at lwlock.c:1638
#10 0x000000000080aa55 in SemiJoinFilterFinishScan
(semiJoinFilters=0x2e349b0, tableId=1253696, parallel_area=0x2e17388) at
nodeMergejoin.c:2035

This only happens in parallel plans, I haven't looked at the details.

I do recall parallel hash join was quite tricky exactly because there
are issues with coordinating building the hash table (e.g. workers might
get stuck due to waiting on shmem queues etc.), I wonder if this might
be something similar due to building the filter. But maybe it's
something trivial.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Support-semijoin-filter-in-the-planner-opti-20221003.patch text/x-patch 76.3 KB
0002-review-20221003.patch text/x-patch 10.0 KB
0003-Support-semijoin-filter-in-the-executor-for-20221003.patch text/x-patch 14.8 KB
0004-review-20221003.patch text/x-patch 4.7 KB
0005-Support-semijoin-filter-in-the-executor-for-20221003.patch text/x-patch 25.8 KB
0006-Integrate-EXPLAIN-command-with-semijoin-fil-20221003.patch text/x-patch 2.9 KB
0007-Add-basic-regress-tests-for-semijoin-filter-20221003.patch text/x-patch 8.7 KB
bloom-bt.txt text/plain 6.0 KB
run.sh application/x-shellscript 1007 bytes
script.template text/plain 7.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-10-03 17:00:46 Re: [patch] \g with multiple result sets and \watch with copy queries
Previous Message Tom Lane 2022-10-03 16:08:47 Re: Question: test "aggregates" failed in 32-bit machine