Re: hashjoins vs. Bloom filters (yet again)

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hashjoins vs. Bloom filters (yet again)
Date: 2026-06-02 15:46:56
Message-ID: 2f9a99ef-4857-4d4c-8544-30078a38dfa1@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/31/26 17:03, Andrew Dunstan wrote:
>
> ..>
> Here are 3 patches (developed using Claude) that sit on top of your POC.
>
> Patch 1 enables the pushdown filters for custom scans. As you say it's
> fairly mechanical and is enabled by a CUSTOMPATH_SUPPORT_BLOOM_FILTERS
> path flag.
>
> Patch 2 provides for building per-key filters in addition to the multi-
> key filter if that flag is set. There may be other cases that would want
> it, but this would suit my immediate use case.
>
> Patch 3 provides for eager creation of the filter(s) in such cases,
> disabling the optimization you mentioned in point 1 above.
>

Thanks. I'll take a look when I have time.

>
>>
>> FWIW I think the main difficulty for this PoC is going to be the
>> planning/costing stuff, and the impact on EXPLAIN.
>>
>>
>
>
> I haven't dealt with that or other issues you raise, but I think this is
> enough for me to begin testing. I have adapted my TAM to it and verified
> that it acts as expected. I will start doing some benchmarks.
>

OK. I think it's enough for testing, i.e. to see if it's actually worth
pursuing further. But I think we'll eventually need to solve the issues
planning/costing somehow. I'm not sure it'll be committable without
having some sort of solution.

I happened to find this 2025 paper:

Including Bloom Filters in Bottom-up Optimization
https://arxiv.org/html/2505.02994v1

I read it over the weekend, and interestingly enough it's exactly about
the planning issues I outlined last week, i.e. difficulties with costing
paths that might include pushed-down Bloom filters.

They even describe a solution that kinda looks a bit like "tracking a
separate set of paths" from my e-mail, although they use somewhat
different terminology (sub-plan == our path, etc.). But if you squint a
little bit, it talks about the costing issue, path explosion, etc.

Their solutions is some sort of two-phase process, which I'm not sure we
can do. It'd require a fundamental rework of how we construct join rels
and all that, and TBH I don't have an ambition to do that.

But while reading the paper, I kept thinking about how we deal with
pathkeys. I wonder if we could do something similar to that? That is,
have a concept of "potentially interesting" filters, and construct the
extra paths only for those, to limit the number of extra paths.

Imagine we construct the the baserels (essentially scan nodes), and then
do a pass over those. Each scan would look at what joins it participates
in, and which of those could benefit from a Bloom filter (some can't,
because it's a FK join, or we don't expect many rejected tuples, or
maybe it's a LEFT JOIN, ... etc.).

And then we'd maybe have some additional heuristics to pick which "Bloom
filters" to attach to the path. And then later, when planning that
particular join involving that path, we'd reject the join if it's not a
hash join. The scan would always have to construct a "clean path" not
requiring any filters, similarly to what custom scans need to do for
parallel paths.

It's just a rough idea, but I think it would work. Worth a try.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-06-02 15:57:34 Re: Heads Up: cirrus-ci is shutting down June 1st
Previous Message Andres Freund 2026-06-02 15:40:40 Re: Heads Up: cirrus-ci is shutting down June 1st