Re: Bloom filter Pushdown Optimization for Merge Join

From: Lyu Pan <lyu(dot)steve(dot)pan(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, zyu(at)yugabyte(dot)com
Cc: Zheng Li <zhengli10(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Bloom filter Pushdown Optimization for Merge Join
Date: 2022-10-12 22:36:00
Message-ID: CAFp_TFZ_vszcdTERDE22a2yvf0QCoL8f6A5XGaWRbbR1+9angg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Zhihong Yu & Tomas Vondra,

Thank you so much for your review and feedback!

We made some updates based on previous feedback and attached the new
patch set. Due to time constraints, we didn't get to resolve all the
comments, and we'll continue to improve this patch.

> In this prototype, the cost model is based on an assumption that there is
> a linear relationship between the performance gain from using a semijoin
> filter and the estimated filtering rate:
> % improvement to Merge Join cost = 0.83 * estimated filtering rate - 0.137.
>
> How were the coefficients (0.83 and 0.137) determined ?
> I guess they were based on the results of running certain workload.

Right, the coefficients (0.83 and 0.137) determined are based on some
preliminary testings. The current costing model is pretty naive and
we'll work on a more robust costing model in future work.

> I agree, in principle, although I think the current logic / formula is a
> bit too crude and fitted to the simple data used in the test. I think
> this needs to be formulated as a regular costing issue, considering
> stuff like cost of the hash functions, and so on.
>
> I think this needs to do two things:
>
> 1) estimate the cost of building the bloom filter - This shall depend on
> the number of rows in the inner relation, number/cost of the hash
> functions (which may be higher for some data types), etc.
>
> 2) estimate improvement for the probing branch - Essentially, we need to
> estimate how much we save by filtering some of the rows, but this also
> neeeds to include the cost of probing the bloom filter.
>
> This will probably require some improvements to the lib/bloomfilter, in
> order to estimate the false positive rate - this may matter a lot for
> large data sets and small work_mem values. The bloomfilter library
> simply reduces the size of the bloom filter, which increases the false
> positive rate. At some point it'll start reducing the benefit.
>

These suggestions make a lot of sense. The current costing model is
definitely not good enough, and we plan to work on a more robust
costing model as we continue to improve the patch.

> OK. Could also build the bloom filter in shared memory?
>

We thought about this approach but didn't prefer this one because if
all worker processes share the same bloom filter in shared memory, we
need to frequently lock and unlock the bloom filter to avoid race
conditions. So we decided to have each worker process create its own
bloom filter.

> IMHO we shouldn't make too many conclusions from these examples. Yes, it
> shows merge join can be improved, but for cases where a hashjoin works
> better so we wouldn't use merge join anyway.
>
> I think we should try constructing examples where either merge join wins
> already (and gets further improved by the bloom filter), or would lose
> to hash join and the bloom filter improves it enough to win.
>
> AFAICS that requires a join of two large tables - large enough that hash
> join would need to be batched, or pre-sorted inputs (which eliminates
> the explicit Sort, which is the main cost in most cases).
>
> The current patch only works with sequential scans, which eliminates the
> second (pre-sorted) option. So let's try the first one - can we invent
> an example with a join of two large tables where a merge join would win?
>
> Can we find such example in existing benchmarks like TPC-H/TPC-DS.
>

Agreed. The current examples are only intended to show us that using
bloom filters in merge join could improve the merge join performance
in some cases. We are working on testing more examples that merge join
with bloom filter could out-perform hash join, which should be more
persuasive.

> The bloom filter is built by the first seqscan (on t0), and then used by
> the second seqscan (on t1). But this only works because we always run
> the t0 scan to completion (because we're feeding it into Sort) before we
> start scanning t1.
>
> But when the scan on t1 switches to an index scan, it's over - we'd be
> building the filter without being able to probe it, and when we finish
> building it we no longer need it. So this seems pretty futile.
>
> It might still improve plans like
>
> -> Merge Join
> Merge Cond: (t0.c1 = t1.c1)
> SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
> SemiJoin Estimated Filtering Rate: 1.0000
> -> Sort
> Sort Key: t0.c1
> -> Seq Scan on t0
> -> Index Scan on t1
>
> But I don't know how common/likely that actually is. I'd expect to have
> an index on both sides, but perhaps I'm wrong.
>
> This is why hashjoin seems like a more natural fit for the bloom filter,
> BTW, because there we have a guarantee the inner relation is processed
> first (so we know the bloom filter is fine and can be probed).
>

Great observation. The bloom filter only works if the first SeqScan
always runs to completion before the second SeqScan starts.
I guess one possible way to avoid futile bloom filter might be
enforcing that the bloom filter only be used if both the outer/inner
plans of the MergeJoin are Sort nodes, to guarantee the bloom filter
is ready to use after processing one side of the join, but this may be
too restrictive.

> I don't know what improvements you have in mind exactly, but I think
> it'd be good to show which node is building/using a bloom filter, and
> then also some basic stats (size, number of hash functions, FPR, number
> of probes, ...). This may require improvements to lib/bloomfilter, which
> currently does not expose some of the details.
>

Along with the new patch set, we have added information to display
which node is building/using a bloom filter (as well as the
corresponding expressions), and some bloom filter basic stats. We'll
add more related information (e.g. FPR) as we modify lib/bloomfilter
implementation in future work.

Thanks again for the great reviews and they are really useful! More
feedback is always welcome and appreciated!

Regards,
Lyu Pan
Amazon RDS/Aurora for PostgreSQL

On Mon, 3 Oct 2022 at 09:14, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> Hello Zheng Li,
>
> Great to see someone is working on this! Some initial comments/review:
>
> On 10/1/22 00:44, Zheng Li wrote:
> > Hello,
> >
> > A bloom filter provides early filtering of rows that cannot be joined
> > before they would reach the join operator, the optimization is also
> > called a semi join filter (SJF) pushdown. Such a filter can be created
> > when one child of the join operator must materialize its derived table
> > before the other child is evaluated.
> >
> > For example, a bloom filter can be created using the the join keys for
> > the build side/inner side of a hash join or the outer side of a merge
> > join, the bloom filter can then be used to pre-filter rows on the
> > other side of the join operator during the scan of the base relation.
> > The thread about “Hash Joins vs. Bloom Filters / take 2” [1] is good
> > discussion on using such optimization for hash join without going into
> > the pushdown of the filter where its performance gain could be further
> > increased.
> >
>
> Agreed. That patch was beneficial for hashjoins with batching, but I
> think the pushdown makes this much more interesting.
>
> > We worked on prototyping bloom filter pushdown for both hash join and
> > merge join. Attached is a patch set for bloom filter pushdown for
> > merge join. We also plan to send the patch for hash join once we have
> > it rebased.
> >
> > Here is a summary of the patch set:
> > 1. Bloom Filter Pushdown optimizes Merge Join by filtering rows early
> > during the table scan instead of later on.
> > -The bloom filter is pushed down along the execution tree to
> > the target SeqScan nodes.
> > -Experiments show that this optimization can speed up Merge
> > Join by up to 36%.
> >
>
> Right, although I think the speedup very much depends on the data sets
> used for the tests, and can be made arbitrarily large with "appropriate"
> data set.
>
> > 2. The planner makes the decision to use the bloom filter based on the
> > estimated filtering rate and the expected performance gain.
> > -The planner accomplishes this by estimating four numbers per
> > variable - the total number of rows of the relation, the number of
> > distinct values for a given variable, and the minimum and maximum
> > value of the variable (when applicable). Using these numbers, the
> > planner estimates a filtering rate of a potential filter.
> > -Because actually creating and implementing the filter adds
> > more operations, there is a minimum threshold of filtering where the
> > filter would actually be useful. Based on testing, we query to see if
> > the estimated filtering rate is higher than 35%, and that informs our
> > decision to use a filter or not.
> >
>
> I agree, in principle, although I think the current logic / formula is a
> bit too crude and fitted to the simple data used in the test. I think
> this needs to be formulated as a regular costing issue, considering
> stuff like cost of the hash functions, and so on.
>
> I think this needs to do two things:
>
> 1) estimate the cost of building the bloom filter - This shall depend on
> the number of rows in the inner relation, number/cost of the hash
> functions (which may be higher for some data types), etc.
>
> 2) estimate improvement for the probing branch - Essentially, we need to
> estimate how much we save by filtering some of the rows, but this also
> neeeds to include the cost of probing the bloom filter.
>
> This will probably require some improvements to the lib/bloomfilter, in
> order to estimate the false positive rate - this may matter a lot for
> large data sets and small work_mem values. The bloomfilter library
> simply reduces the size of the bloom filter, which increases the false
> positive rate. At some point it'll start reducing the benefit.
>
> > 3. If using a bloom filter, the planner also adjusts the expected cost
> > of Merge Join based on expected performance gain.
> >
>
> I think this is going to be a weak point of the costing, because we're
> adjusting the cost of the whole subtree after it was costed.
>
> We're doing something similar when costing LIMIT, and that can already
> causes a lot of strange stuff with non-uniform data distributions, etc.
>
> And in this case it's probably worse, because we're eliminating rows at
> the scan level, without changing the cost of any of the intermediate
> nodes. It's certainly going to be confusing in EXPLAIN, because of the
> discrepancy between estimated and actual row counts ...
>
> > 4. Capability to build the bloom filter in parallel in case of
> > parallel SeqScan. This is done efficiently by populating a local bloom
> > filter for each parallel worker and then taking a bitwise OR over all
> > the local bloom filters to form a shared bloom filter at the end of
> > the parallel SeqScan.
> >
>
> OK. Could also build the bloom filter in shared memory?
>
> > 5. The optimization is GUC controlled, with settings of
> > enable_mergejoin_semijoin_filter and force_mergejoin_semijoin_filter.
> >
> > We found in experiments that there is a significant improvement
> > when using the bloom filter during Merge Join. One experiment involved
> > joining two large tables while varying the theoretical filtering rate
> > (TFR) between the two tables, the TFR is defined as the percentage
> > that the two datasets are disjoint. Both tables in the merge join were
> > the same size. We tested changing the TFR to see the change in
> > filtering optimization.
> >
> > For example, let’s imagine t0 has 10 million rows, which contain the
> > numbers 1 through 10 million randomly shuffled. Also, t1 has the
> > numbers 4 million through 14 million randomly shuffled. Then the TFR
> > for a join of these two tables is 40%, since 40% of the tables are
> > disjoint from the other table (1 through 4 million for t0, 10 million
> > through 14 million for t4).
> >
> > Here is the performance test result joining two tables:
> > TFR: theoretical filtering rate
> > EFR: estimated filtering rate
> > AFR: actual filtering rate
> > HJ: hash join
> > MJ Default: default merge join
> > MJ Filter: merge join with bloom filter optimization enabled
> > MJ Filter Forced: merge join with bloom filter optimization forced
> >
> > TFR EFR AFR HJ MJ Default MJ Filter MJ Filter Forced
> > -------------------------------------------------------------------------------------
> > 10 33.46 7.41 6529 22638 21949 23160
> > 20 37.27 14.85 6483 22290 21928 21930
> > 30 41.32 22.25 6395 22374 20718 20794
> > 40 45.67 29.7 6272 21969 19449 19410
> > 50 50.41 37.1 6210 21412 18222 18224
> > 60 55.64 44.51 6052 21108 17060 17018
> > 70 61.59 51.98 5947 21020 15682 15737
> > 80 68.64 59.36 5761 20812 14411 14437
> > 90 77.83 66.86 5701 20585 13171 13200
> > Table. Execution Time (ms) vs Filtering Rate (%) for Joining Two
> > Tables of 10M Rows.
> >
> > Attached you can find figures of the same performance test and a SQL script
> > to reproduce the performance test.
> >
> > The first thing to notice is that Hash Join generally is the most
> > efficient join strategy. This is because Hash Join is better at
> > dealing with small tables, and our size of 10 million is still small
> > enough where Hash Join outperforms the other join strategies. Future
> > experiments can investigate using much larger tables.
> >
> > However, comparing just within the different Merge Join variants, we
> > see that using the bloom filter greatly improves performance.
> > Intuitively, all of these execution times follow linear paths.
> > Comparing forced filtering versus default, we can see that the default
> > Merge Join outperforms Merge Join with filtering at low filter rates,
> > but after about 20% TFR, the Merge Join with filtering outperforms
> > default Merge Join. This makes intuitive sense, as there are some
> > fixed costs associated with building and checking with the bloom
> > filter. In the worst case, at only 10% TFR, the bloom filter makes
> > Merge Join less than 5% slower. However, in the best case, at 90% TFR,
> > the bloom filter improves Merge Join by 36%.
> >
> > Based on the results of the above experiments, we came up with a
> > linear equation for the performance ratio for using the filter
> > pushdown from the actual filtering rate. Based on the numbers
> > presented in the figure, this is the equation:
> >
> > T_filter / T_no_filter = 1 / (0.83 * estimated filtering rate + 0.863)
> >
> > For example, this means that with an estimated filtering rate of 0.4,
> > the execution time of merge join is estimated to be improved by 16.3%.
> > Note that the estimated filtering rate is used in the equation, not
> > the theoretical filtering rate or the actual filtering rate because it
> > is what we have during planning. In practice the estimated filtering
> > rate isn’t usually accurate. In fact, the estimated filtering rate can
> > differ from the theoretical filtering rate by as much as 17% in our
> > experiments. One way to mitigate the power loss of bloom filter caused
> > by inaccurate estimated filtering rate is to adaptively turn it off at
> > execution time, this is yet to be implemented.
> >
>
> IMHO we shouldn't make too many conclusions from these examples. Yes, it
> shows merge join can be improved, but for cases where a hashjoin works
> better so we wouldn't use merge join anyway.
>
> I think we should try constructing examples where either merge join wins
> already (and gets further improved by the bloom filter), or would lose
> to hash join and the bloom filter improves it enough to win.
>
> AFAICS that requires a join of two large tables - large enough that hash
> join would need to be batched, or pre-sorted inputs (which eliminates
> the explicit Sort, which is the main cost in most cases).
>
> The current patch only works with sequential scans, which eliminates the
> second (pre-sorted) option. So let's try the first one - can we invent
> an example with a join of two large tables where a merge join would win?
>
> Can we find such example in existing benchmarks like TPC-H/TPC-DS.
>
> > Here is a list of tasks we plan to work on in order to improve this patch:
> > 1. More regression testing to guarantee correctness.
> > 2. More performance testing involving larger tables and complicated query plans.
> > 3. Improve the cost model.
>
> +1
>
> > 4. Explore runtime tuning such as making the bloom filter checking adaptive.
>
> I think this is tricky, I'd leave it out from the patch for now until
> the other bits are polished. It can be added later.
>
> > 5. Currently, only the best single join key is used for building the
> > Bloom filter. However, if there are several keys and we know that
> > their distributions are somewhat disjoint, we could leverage this fact
> > and use multiple keys for the bloom filter.
>
> True, and I guess it wouldn't be hard.
>
> > 6. Currently, Bloom filter pushdown is only implemented for SeqScan
> > nodes. However, it would be possible to allow push down to other types
> > of scan nodes.
>
> I think pushing down the bloom filter to other types of scans is not the
> hard part, really. It's populating the bloom filter early enough.
>
> Invariably, all the examples end up with plans like this:
>
> -> Merge Join
> Merge Cond: (t0.c1 = t1.c1)
> SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
> SemiJoin Estimated Filtering Rate: 1.0000
> -> Sort
> Sort Key: t0.c1
> -> Seq Scan on t0
> -> Materialize
> -> Sort
> Sort Key: t1.c1
> -> Seq Scan on t1
>
> The bloom filter is built by the first seqscan (on t0), and then used by
> the second seqscan (on t1). But this only works because we always run
> the t0 scan to completion (because we're feeding it into Sort) before we
> start scanning t1.
>
> But when the scan on t1 switches to an index scan, it's over - we'd be
> building the filter without being able to probe it, and when we finish
> building it we no longer need it. So this seems pretty futile.
>
> It might still improve plans like
>
> -> Merge Join
> Merge Cond: (t0.c1 = t1.c1)
> SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
> SemiJoin Estimated Filtering Rate: 1.0000
> -> Sort
> Sort Key: t0.c1
> -> Seq Scan on t0
> -> Index Scan on t1
>
> But I don't know how common/likely that actually is. I'd expect to have
> an index on both sides, but perhaps I'm wrong.
>
> This is why hashjoin seems like a more natural fit for the bloom filter,
> BTW, because there we have a guarantee the inner relation is processed
> first (so we know the bloom filter is fine and can be probed).
>
> > 7. Explore if the Bloom filter could be pushed down through a foreign
> > scan when the foreign server is capable of handling it – which could
> > be made true for postgres_fdw.
>
> Neat idea, but I suggest to leave this out of scope of this patch.
>
> > 8. Better explain command on the usage of bloom filters.
> >
>
> I don't know what improvements you have in mind exactly, but I think
> it'd be good to show which node is building/using a bloom filter, and
> then also some basic stats (size, number of hash functions, FPR, number
> of probes, ...). This may require improvements to lib/bloomfilter, which
> currently does not expose some of the details.
>
> > This patch set is prepared by Marcus Ma, Lyu Pan and myself. Feedback
> > is appreciated.
> >
>
> Attached is a patch series with two "review" parts (0002 and 0004). I
> already mentioned some of the stuff above, but a couple more points:
>
> 1) Don't allocate memory directly through alloca() etc. Use palloc, i.e.
> rely on our memory context.
>
> 2) It's customary to have "PlannerInfo *root" as the first parameter.
>
> 3) For the "debug" logging, I'd suggest to do it the way TRACE_SORT
> (instead of inventing a bunch of dbg routines).
>
> 4) I find the naming inconsistent, e.g. with respect to the surrounding
> code (say, when everything around starts with Exec, maybe the new
> functions should too?). Also, various functions/variables say "semijoin"
> but then we apply that to "inner joins" too.
>
> 5) Do we really need estimate_distincts_remaining() to implement yet
> another formula for estimating number of distinct groups, different from
> estimate_num_groups() does? Why?
>
> 6) A number of new functions miss comments explaining the purpose, and
> it's not quite clear what the "contract" is. Also, some functions have
> new parameters but the comment was not updated to reflect it.
>
> 7) SemiJoinFilterExamineSlot is matching the relations by OID, but
> that's wrong - if you do a self-join, both sides have the same OID. It
> needs to match RT index (I believe scanrelid in Scan node is what this
> should be looking at).
>
> There's a couple more review comments in the patches, but those are
> minor and not worth discussing here - feel free to ask, if anything is
> not clear enough (or if you disagree).
>
>
> I did a bunch of testing, after tweaking your SQL script.
>
> I changed the data generation a bit not to be so slow (instead of
> relying on unnest of multiple large sets, I use one sequence and random
> to generate data). And I run the tests with different parameters (step,
> work_mem, ...) driven by the attached shell script.
>
> And it quickly fails (on assert-enabled-build). I see two backtraces:
>
> 1) bogus overlapping estimate (ratio > 1.0)
> ...
> #4 0x0000000000c9d56b in ExceptionalCondition (conditionName=0xe43724
> "inner_overlapping_ratio >= 0 && inner_overlapping_ratio <= 1",
> errorType=0xd33069 "FailedAssertion", fileName=0xe42bdb "costsize.c",
> lineNumber=7442) at assert.c:69
> #5 0x00000000008ed767 in evaluate_semijoin_filtering_rate
> (join_path=0x2fe79f0, equijoin_list=0x2fea6c0, root=0x2fe6b68,
> workspace=0x7ffd87588a78, best_clause=0x7ffd875888cc,
> rows_filtered=0x7ffd875888c8) at costsize.c:7442
>
> Seems it's doing the math wrong, or does not expect some corner case.
>
>
> 2) stuck spinlock in SemiJoinFilterFinishScan
> ...
> #5 0x0000000000a85cb0 in s_lock_stuck (file=0xe68c8c "lwlock.c",
> line=907, func=0xe690a1 "LWLockWaitListLock") at s_lock.c:83
> #6 0x0000000000a85a8d in perform_spin_delay (status=0x7ffd8758b8e8) at
> s_lock.c:134
> #7 0x0000000000a771c3 in LWLockWaitListLock (lock=0x7e40a597c060) at
> lwlock.c:911
> #8 0x0000000000a76e93 in LWLockConflictsWithVar (lock=0x7e40a597c060,
> valptr=0x7e40a597c048, oldval=1, newval=0x7e40a597c048,
> result=0x7ffd8758b983) at lwlock.c:1580
> #9 0x0000000000a76ce9 in LWLockWaitForVar (lock=0x7e40a597c060,
> valptr=0x7e40a597c048, oldval=1, newval=0x7e40a597c048) at lwlock.c:1638
> #10 0x000000000080aa55 in SemiJoinFilterFinishScan
> (semiJoinFilters=0x2e349b0, tableId=1253696, parallel_area=0x2e17388) at
> nodeMergejoin.c:2035
>
> This only happens in parallel plans, I haven't looked at the details.
>
> I do recall parallel hash join was quite tricky exactly because there
> are issues with coordinating building the hash table (e.g. workers might
> get stuck due to waiting on shmem queues etc.), I wonder if this might
> be something similar due to building the filter. But maybe it's
> something trivial.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Attachment Content-Type Size
v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch application/octet-stream 76.2 KB
v1-0002-Support-semijoin-filter-in-the-executor-for-non-para.patch application/octet-stream 14.0 KB
v1-0004-Integrate-EXPLAIN-command-with-semijoin-filter.patch application/octet-stream 8.1 KB
v1-0003-Support-semijoin-filter-in-the-executor-for-parallel.patch application/octet-stream 22.4 KB
v1-0005-Add-basic-regression-tests-for-semijoin-filter.patch application/octet-stream 23.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-10-12 22:52:36 Re: Tracking last scan time
Previous Message Peter Smith 2022-10-12 22:07:33 Re: create subscription - improved warning message