| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Andrei Lepikhov <lepihov(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, andrew(at)dunslane(dot)net |
| Subject: | Re: hashjoins vs. Bloom filters (yet again) |
| Date: | 2026-06-02 16:01:59 |
| Message-ID: | 722610dd-954a-4fad-a4dc-5f42075845b1@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 6/1/26 11:30, Andrei Lepikhov wrote:
> Postgres is still gaining ground in this area. It’s helpful to see how other
> databases handle these challenges.
>
> On 30/05/2026 02:55, Tomas Vondra wrote:
>> The patches got stuck mostly because deciding if it makes sense to
>> build/use the Bloom filter is somewhat hard. For cases where 100% of the
>> tuples have a match it's pointless - it's just pure cost, no benefit.
>> The regressions are relatively small, though (<10%).
>
> We ran into the same problem when trying to estimate the number of 'generated'
> NULLs on the nullable side. So, it makes sense to focus on the estimation method
> for 'unmatched' tuples as a separate task.
>
Not sure how treating it as a separate task solves that? In any case,
the patches I posted a couple minutes ago (for filters in the scope of a
single hashjoin, but the problem is the same) deal with this by delaying
the build until execution time, when we have better idea how many outer
tuples match the hash table.
>> However, chances are the filter is too big. We can't get work_mem, the
>> join is already using that for the hash table etc. We can maybe use a
>> fraction of it, and that may not be enough to fit the "perfect" filter.
>> We could bail out and not use any Bloom filter at all, but that seems a
>> bit silly. Maybe we can't fit the 2% filter, but 5% of 10% would be OK?
>
> Looking at DuckDB’s code, using bloom filters during hash table construction
> solves this issue.
> From what I can tell, Apache Impala [1] and Spark [2] use the same approach.
>
It's not clear how any of these solve the issue I described (about
sizing the filter and the trade offs). The links just say that the
feature exist.
>> Of course, it depends on the selectivity of the joins (and thus how many
>> tuples get discarded by the filters). But because it moves all the
>> "cheap" filter probes *before* probing any of the hash tables, it has a
>> multiplication effect for the benefits.
>
> In my experience, the outer side often has a complex subtree and is sometimes
> capped by a GROUP BY statement, or even a HAVING clause, which can break all
> estimations. A bloom filter might help if there is an accidental misestimate.
>
Perhaps, but with substantial misestimates all bets are off. Maybe it'd
be better to discuss a particular example.
>> So I tend to see this as an opportunistic optimization. We do the
>> planning assuming there's no Bloom filter push-down, and then after the
>> fact we see if there's an opportunity after all. Which means we may not
>> pick a plan with hash joins, not realizing it might be made faster.
>
> This approach should not cause any issues. It is likely a reasonable way to
> improve performance without expanding the optimisation scope, which would
> increase planning time. We can always adjust it later if needed.
> For example, I am designing the post-optimising NestLoop 'lazy join' [3] using
> the 'gating' concept.
>
I agree, except that it also makes EXPLAIN pretty difficult to
interpret, because it "breaks" the row counts.
>>
>> The bigger issue for me is that it may make the EXPLAIN ANALYZE output
>> way harder to understand. The estimated "rows" are calculated before the
>> filter push-down happens, while the actual "rows" are with the filter
>> probing, of course. But it seems pretty easy to get confused by this,
>> and think it's just an incorrect estimate.
>
> People are often confused when trying to understand the correctness of
> estimation for parallel plans and, in some cases, MergeJoin plans. Personally, I
> don't think it's a big issue.
>
I disagree. The fact that people may be confused by plans does not mean
we can just make plans confusing for everyone.
> Overall, I think there are even more useful ways to apply bloom filters in the
> planner:
> 1. Real-time partition pruning
> 2. FDW pushed-down filters, which are especially helpful for sharded tables.
> 3. Skipping storage layer blocks. I know of at least one attempt to use the
> BRIN+FSM approach to avoid reading parts of a large table that definitely don't
> match the filter. Bloom filters could be used here as well.
>
Could be. I speculated about options (2) and (3) myself elsewhere in
this thread.
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Melanie Plageman | 2026-06-02 16:15:40 | Re: BM_IO_ERROR flag is lost in TerminateBufferIO due to order of operations in UnlockBufHdrExt |
| Previous Message | Andres Freund | 2026-06-02 15:57:34 | Re: Heads Up: cirrus-ci is shutting down June 1st |