Re: hashjoins vs. Bloom filters (yet again)

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hashjoins vs. Bloom filters (yet again)
Date: 2026-06-02 06:25:46
Message-ID: 494586a2-fd9b-44ad-9bb5-4b6cc18bdf53@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30/05/2026 02:55, Tomas Vondra wrote:
> Earlier, I mentioned the push-down happens in create_hashjoin_plan().
> Which means it happens *after* planning and costing. There are reasons
> for that, but it has some unfortunate & annoying consequences.
>
> Ideally, we'd know about the filters when constructing the scan nodes,
> so we'd have a chance to estimate how many tuples will be eliminated by
> probing the filters (which is about the same thing as estimating the
> join sizes). But we can't do that, because our planner works bottom-up.
> When constructing the scan nodes we know which tables we'll join with,
> but we have no idea which of the join algorithms we'll pick.
>
> We'll consider all three join types, and the scan node has no say which
> of those will win. But the Bloom filter push-down is specific to hash
> joins. So what should the scan node do? Either it can assume it's under
> hash join (and set rows/cost as if there's a Bloom filter), or it can
> set costs in a join-agnostic way (like now).
>
> The only "correct" way I can think of dealing with this in the bottom-up
> world is having two sets of paths - one set for a hash join, one set for
> other joins. But that's not just for scans. We'd need that for all
> paths, and for different combinations of joins. For the query with 3
> joins, we'd end up with 2^3 combinations. That seems not great.
I overlooked this part of your first message, so let me add a quick comment.

In principle, the optimiser is not restricted to bottom-up planning. For
example, in extension modules, I sometimes use the create_upper_paths_hook to
add a 'Top-Down' iteration after 'bottom-up' planning [1].

This helps improve complex query plans, such as adding a Memoize node at the
head of a subplan when the number of distinct input parameter values is expected
to be low. It can also use the startup_cost-optimal subpaths in MergeJoin if
histogram comparisons indicate that only a small portion of the input will be
scanned. There are other possible cases involving LIMIT and sort propagation as
well.

I'm not sure whether this approach makes sense for the specific technique you
develop, since it's already quite complex. Also, additional planning iteration
is a pure overhead in most of cases except complex analytical queries. However,
it might provide an idea for future improvement.

[1] https://github.com/danolivo/conf/blob/main/2025-MiddleOut/MiddleOut.pdf

--
regards, Andrei Lepikhov,
pgEdge

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-06-02 06:28:07 Re: A few message wording/formatting cleanup patches
Previous Message Chao Li 2026-06-02 06:24:01 Re: Fix bug of CHECK constraint enforceability recursion