| From: | Chengpeng Yan <chengpeng_yan(at)outlook(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | John Naylor <johncnaylorls(at)gmail(dot)com>, lakshmi <lakshmigcdac(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> |
| Subject: | Re: Add a greedy join search algorithm to handle large join problems |
| Date: | 2026-06-23 06:46:35 |
| Message-ID: | 4B651EFB-1EF1-47AA-A9CF-951C46FF487E@outlook.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Tomas,
Thanks a lot for taking another look, for rebasing the patches, and for
the detailed high-level feedback.
> On Jun 22, 2026, at 23:33, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hello Chengpeng,
>
> I finally had some time to look at this patch series again, and I wonder
> what are your plans with it for the PG20 cycle.
My goal for PG20 is still to make progress toward a practical GEQO
replacement for large join problems. I would like to push this work
toward something that can be merged, but I agree the final shape does
not necessarily have to be the current GOO algorithm. If a LinDP-style
approach turns out to be a better fit for postgres, I would be happy to
explore that direction too.
> FWIW attached is a v6 of the most patches, which is merely v5 rebased to
> current master (the was a bit of bitrot). I haven't done any changes. I
> have a couple minor comments, but but let me share some high level ideas
> first about improvements to join search in general.
Thanks for rebasing the series. One thing I am still working on is
`pg_plan_advice` support for GOO. I think this is important because
large join problems need a practical way to be tuned when the heuristic
picks a bad plan, and GEQO has limitations in this area today.
> I happened to be reading some recent papers about join order search,
> because of a separate problem of replacing join_collapse_limit with
> something less blunt (I'll start a separate thread about it). And there
> seems to be a number of very promising approaches to join order search,
> some of which seem like a better fit than the proposed GOO.
>
> Let me explain why I think so.
>
> For starters, consider this "join ordering" CMU lecture [1] from 2025.
> That's Andy Pavlo's lecture, which generally are great overviews of the
> particular field, including the new ides/approaches.
>
> At the beginning, it divided the queries by "size" (number of relations
> joined, for simplicity), and it maps these cases to approaches:
>
> 1) small => solve optimally
>
> 2) medium => search space linearization
>
> 3) large => graceful greediness
>
> Our standard_join_search is clearly for the "small" case. I believe GOO
> is a "graceful greediness" approach - it certainly does not do any
> linearization, at least as I understand it. So it'd be good for "large"
> problems (which I think is 1000+ tables).
>
> The way I understand this, this suggests there are approaches that can
> do much better for "medium" queries with up to ~1000 tables (and that's
> going to be ~99.999% of user queries). So by going to GOO right away,
> we're probably failing to find much better plans.
>
> In particular, I find the papers about the "LinDP" family of algorithms
> extremely intriguing:
>
> - Adaptive Optimization of Very Large Join Queries (Radke, Neumann),
> 2018 [2]
>
> - LinDP++: Generalizing Linearized DP to Crossproducts and Non-Inner
> Joins (Radke, Neumann), 2019 [3]
>
> - Optimizing Linearized Join Enumeration by Adapting to the Query
> Structure (Birler, Stoian, Neumann), 2025 [4]
>
> Highly recommended papers, BTW. It was a fun weekend read. T. Neumann is
> a guarantee of a great read.
>
>
> So maybe we should try implementing one of these algorithms first?
>
> I don't actually think it'd be more code than the current GOO code, and
> it seems to have a reasonably sound theory behind it.
>
> Of course, we may still need some "fallback" mode with heuristics, for
> joins too large even for the LinDP algorithms. So maybe that's where GOO
> would come into play? But do we have an idea how large joins it can
> actually solve in practice? Maybe the LinDP algorithms are universally
> better, even for large joins?
I agree with this framing, and the LinDP family does look very
promising. It may well be a better fit for the "medium" range than going
directly from DP to a pure greedy algorithm.
My reason for starting with GOO was mostly pragmatic. It is simple,
direct, and small enough to keep the integration and review surface
manageable. It also gives us a concrete baseline for the "graceful
greediness" side of the design space, either as a possible fallback for
very large joins or as a reference point for comparing more advanced
algorithms.
I do not yet have a strong answer for how large a join problem GOO can
handle well in practice. That is one of the main things the benchmark
work needs to answer. I also think getting the benchmark methodology
right is important before comparing GOO with a more sophisticated
algorithm. BTW, I also think constructing useful benchmark cases for
this problem is itself difficult, and I am still thinking about that.
I did skim through some of the papers you mentioned before, but it has
been a while. I need to go back and read them more carefully before
forming a stronger opinion.
At first glance, my main question is not the algorithmic complexity
itself, but how naturally LinDP maps onto postgres' current join-search
framework. I expect much of the costing and join-legality machinery to
remain useful. What I need to understand better is the enumeration side:
LinDP is organized around a linearization of the join problem, while
postgres's existing join search is mostly built around forming joinrels
for relation subsets.
Once the current GOO patch is in a more reviewable state, I would be
interested in trying a LinDP-style implementation and comparing it
against GOO/GEQO under the same benchmark setup.
> Now, a couple comments about the v5 (or v6) patch.
>
> First, I generally like the code. I haven't done a deep thorough review,
> but I find it quite readable. I also like how much smaller it is than
> the GEQO code - it's 24kB vs. 172kB. Of course, that only matters if GOO
> can do good enough job (compared to GEQO).
>
> While messing with the patch, I fond a relatively simple query that
> fails to find a plan:
>
>
> CREATE TABLE t1 (a int, b int);
> CREATE TABLE t2 (a int, b int);
> CREATE TABLE t3 (a int, b int);
> CREATE TABLE t4 (a int, b int);
>
> SET enable_goo_join_search=on;
> SET geqo_threshold = 2;
>
> SELECT 1 FROM
> t2 B, t3 C,
> LATERAL (SELECT t1.a, t1.b FROM t1 WHERE t1.b = C.a OFFSET 0) A,
> LATERAL (SELECT t4.a, t4.b FROM t4 WHERE t4.b = B.a OFFSET 0) D
> WHERE A.a = B.a AND C.a = D.a;
>
> ERROR: GOO join search failed: all strategies exhausted without a valid
> join order
>
>
> I'm not really sure what's causing this, but I see GEQO has this comment
> about LATERAL joins:
>
> * In particular, we might clump together relations that actually
> * mustn't be joined yet due to LATERAL restrictions; since there's no
> * provision for un-clumping, this must lead to failure.
>
> Could this be causing issues for GOO? If it merges clumps it shouldn't
> have merged, what happens then? Does it just fail?
>
> IMHO that's not a behavior we could accept. A practical join search
> algorithm needs to be able to recover in some way, especially if it's
> used as a fallback for joins that are "too large" for the DP case.
Yes, I think this is the same class of problem. More generally,
heuristic join search methods that commit clumps irreversibly can run
into this kind of issue with the current infrastructure, although the
exact triggering queries may differ.
GOO already uses the existing per-candidate checks, so immediately
illegal joins should be rejected. The missing part is that those checks
do not prove that committing a locally valid clump still leaves a
globally completable join problem. With `LATERAL`, a joinrel can be
legally formed while its paths remain parameterized by rels outside the
clump. If an irreversible heuristic commits the wrong clump, the
remaining clumps may no longer have a legal completion.
The possible fixes I see are:
1. Add a stronger candidate-safety check before committing a GOO clump.
For `LATERAL`, that means not only asking whether the candidate
itself is legal, but also whether committing it leaves the remaining
clumps with a legal continuation.
2. Add limited backtracking, so that if the greedy path reaches a dead
end, GOO can undo one or more committed clumps and try the next
candidate.
3. Fall back to GEQO when GOO cannot complete a join order. This may be
useful as a safety valve, and GEQO can produce a plan for this
particular case. But I do not think it fully solves the underlying
problem, because the GEQO comment points out a similar limitation of
irreversible heuristic clumping. In the current patch I made this
fail hard instead of silently falling back to GEQO for two reasons:
first, to expose these GOO failure cases during review; and second,
because if we eventually want another algorithm to replace GEQO,
depending on GEQO as the long-term recovery path may not be the right
maintenance model.
My current inclination is toward the first option, because it tries to
address the failure mode inside GOO instead of just hiding it behind a
fallback. But I need to study whether such a check is feasible and how
difficult it would be to implement.
I am not yet sure whether a LinDP-style approach would naturally avoid
this, and I need to study that more before making a stronger claim.
--
Best regards,
Chengpeng Yan
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Xing Guo | 2026-06-23 06:46:44 | [PATCH v1] PL/Perl: Fix NULL deref for forged array |
| Previous Message | Christophe Pettus | 2026-06-23 06:42:37 | Re: The PostgreSQL C Dialect |