| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: hashjoins vs. Bloom filters (yet again) |
| Date: | 2026-06-02 15:22:36 |
| Message-ID: | a4700550-474f-42d8-b9fc-7e3dd0bfa818@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
I kept thinking about the various issues discussed after I posted the v1
pshdown patch. Some of the issues are specific to the pushdown (to scan
nodes), but a lot of the issues seem to be shared with using Bloom
filters within the hashjoin (which is what the old threads were about).
We'd need to do something about these issues no matter where we place
the filter, so it's a bit of prerequisite for using Bloom in hash joins
in general. And they seem somewhat more limited / easier to solve than
the planning/costing issues.
So I decided it'd be interesting to see how beneficial can the Bloom
filters be in the scope of a single hashjoin, without pushing it all the
way to the scan nodes, and see what we can do about the issues.
Attached is a PoC patch series optinally adding Bloom filters to a hash
join, both for serial and parallel joins. It's labeled as v2, but it's
really independent of the v1 pushdoown patch posted last week. Some of
the ideas implemented in this could be applied to the pushdown patch too
(in particular all the adaptive behavior).
I'm not sure if we should try to merge these two things into a single
patch series, or whether it'd be better to split those into two threads
(otherwise it'll just keep confusing both people and cfbot).
how the patch works
-------------------
Anyway, let me briefly explain what the patch does (see the commit
messages and comments for more details, I tried to keep those
comprehensive). I suggest focusing on the serial case (in 0001), the
parallel joins are a direct extension of that - but inherently harder to
understand, due to the parallel hash build, shmem etc.
In principle, using Bloom filters is pretty simple - while adding tuples
from the inner relation to the hash table, build also a Bloom filter and
then use it to discard outer tuples cheaply, without having to do an
expensive lookup in a hash table. It does not depend if the hash table
is in private or shared memory.
The difficulty is to figure out whether it makes sense to build/probe
the filter. For that to be the case, the filter needs to eliminate
enough outer tuples, so that the hash table lookup is not needed, and/or
the tuple can be discarded without spilling it to disk (with nbatch>1).
Note: With the pushdown, the benefits "compound" by combining multiple
filters (if there are multiple joins) and/or by skipping some
intermediate operators (between the scan and the hashjoin). So it's
maybe less risky, but the issue still exists.
adaptive build / probing
------------------------
I see two complementary ways to deal with this - during planning (based
on estimates and a cost model), and adaptively during execution (based
on probe/lookup stats). The v2 patch does the latter, mostly because I
think it's beneficial even if we eventually add some smarts to the
planning phase.
The adaptive behavior decides (a) when a filter is built, and (b) if a
filter is probed before hash table lookups.
For builds, we don't want to build filters when ~100% of lookups in the
hash table find a match. It'd not pay for itself. So when the hash table
fits into memory (nbatch=1), we wait for the first 1000 lookups, and
only build the filter only if <90% have a match (and recheck once in a
while, so the filter may be built later).
But with batched joins (nbatch>1) we can't delay building the filter, we
have to decide before spilling some of the tuples to disk (otherwise the
filter would be incomplete, and we couldn't reject tuples from later
batches - which is the main benefit with batched joins). So with batched
joins we build the filter, and hope that either it helps, or the
overhead is negligible overall.
Then when probing, we don't want to use filter that does not reject any
tuples. To deal with this, the patch tracks number of probes and number
of rejections, and if fewer than 10% of probes reject the tuple (i.e.
the filter is ineffective), it gets temporarily "disabled". When
disabled, a filter samples 1% of probes, and then may get enabled again
if the fraction of rejected tuples gets >20%.
Overall, this seems to work pretty well. Of course, it can be improved
in various ways. For example, the thresholds 10% and 20% are somewhat
arbitrary - it's based on earlier experiments, and it works OK on a
number of machines, with different queries / data types. But having a
more formal "cost model" for Bloom filters might help.
Another possible improvement is about maybe doing some decisions during
planning, particularly when the decisions are reliable. I'm rather
skeptical about deciding to build a Bloom filter based on estimates. I
think it's better to do that decision during execution, as explained in
the preceding sections. We could still consider the "expected" Bloom
filter for costing purposed, but leave the decision for execution.
However, in some cases we may be able to know for sure a Bloom filter is
useless. For example, if we know a given join is on a FK, every outer
tuple will have a match. In that case the filter can't help. The patch
won't build it anyway (at least for nbatch=1), thanks to the adaptive
build heuristics. But we could short-circuit that entirely.
perf evaluation
---------------
Now, some numbers. Attached is a .tgz with benchmark script running a
hashjoin on two tables (fact-dimension), varying the selectivity of the
join (5%-100%), work_mem, number of parallel workers, data types of the
join keys, and size of the tables. There's a .csv with more complete
results of the tests, I'll focus on results for scale 100, i.e. fact
100M rows, dimension 10M rows.
The two attached PDFs show timings for master + patched branch, with
enable_hashjoin_bloomm=on/off. And then columns showing timing relative
to master (<1.0 speedup / green, >1.0 regression / red). Green = good.
BTW this is from my ryzen machine (Ryzen 9 9900X).
The results for serial queries (workers=0) seem pretty nice. For
selective joins (>50% outer tuples discarded) it's about 20% faster, and
with 5% selectivity (95% discarded), it's ~2x faster. Which seems nice.
The adaptive thresholds seem to about match reality.
For parallel queries it's a bit worse. There are some nice speedups, but
the benefits are clearly more limited. One interesting observation is
that while for serial queries, the cases that most benefit are with
batching, while with parallel joins it's exactly the opposite. See the
hashjoin-bloom-batched.pdf, which shows timings only for queries with
batched joins.
I'm not sure why is that, but it's entirely possible it's due to a bug
in the patch - the parallel join is fairly complex, I can't rule this
out. Or it might be due to some hardware bottlenecks or whatever?
I'd definitely welcome some review and ideas what might be causing this.
One thing I realized when looking at the results is that this may need
some different trade offs regarding the size of the filter. The library
lib/bloomfilter.c aims for 1-2% false positive rate, but we sometimes
end up with a filter like this:
Bloom Filter: Size: 16384kB Hash Functions: 10
False Positive Rate: 0.077%
This is for work_mem=64MB, with batched join:
Buckets: 2097152 Batches: 16 Memory Usage: 82784kB
so maybe it's not that large. But maybe it'd be better to accept
somewhat higher false-positive rate (e.g. ~10%) in exchange for a much
smaller filter, and fewer hash functions (i.e. fewer bits to check)?
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0002-Using-Bloom-filters-for-parallel-hash-joins.patch | text/x-patch | 39.4 KB |
| v2-0001-Using-Bloom-filters-for-serial-hash-joins.patch | text/x-patch | 50.7 KB |
| hashjoin-bloom-complete.pdf | application/pdf | 84.9 KB |
| hashjoin-bloom-batched.pdf | application/pdf | 79.7 KB |
| hashjoin-bloom.tgz | application/x-compressed-tar | 65.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-06-02 15:40:40 | Re: Heads Up: cirrus-ci is shutting down June 1st |
| Previous Message | Tom Lane | 2026-06-02 15:20:46 | Re: Broken build on macOS (Universal / Intel): cpuid instruction not available |