Re: Add a greedy join search algorithm to handle large join problems

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, lakshmi <lakshmigcdac(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Add a greedy join search algorithm to handle large join problems
Date: 2026-05-12 11:51:50
Message-ID: CANWCAZaetVkaZnR_fxw1DAUrSJ+sQ1PT6UFdwHarHUNNPzWueg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 11, 2026 at 9:33 PM Chengpeng Yan <chengpeng_yan(at)outlook(dot)com> wrote:

> > On May 5, 2026, at 13:40, John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
> > Given that all heuristic join enumeration methods can produce
> > spectacularly bad plans, the ability to influence the plan is more
> > crucial with large join problems with small ones. Features should be
> > orthogonal in general, and in this case, integrating well with plan
> > advice seems like a strong deciding factor.
>
> I agree. For large joins, cardinality estimation, the cost model, and
> the join-order search all interact. Even a better search method can
> choose a bad plan if the cardinality estimates or cost model give it
> misleading input. So the ability to tune or influence the chosen plan
> becomes especially important.
>
> After reading more of the pg_plan_advice discussion and code, my current
> understanding is that GOO should be easier than GEQO to integrate with
> join order advice. Please correct me if this understanding is wrong. My
> reading of the pg_plan_advice discussion is that advice can only
> reliably nudge the planner towards plans it would have considered
> anyway; with GEQO, not all join orders are explored, so join-order
> advice can only constrain the options that the genetic search actually
> considers [3][4].

I believe that's the current thinking, but note this from the plan
advice README:

"XXX Need to investigate whether and how well supplying advice works with GEQO"

> GOO seems to have a useful property here: candidate generation is
> explicit. At each contraction step it evaluates legal pair joins, so an
> advice implementation could reject or prefer candidate pairs at that
> point. This does not make impossible advice feasible, and it still would
> need careful integration, but it seems less dependent on whether GEQO
> happened to generate a compatible join sequence.

That's a reasonable hypothesis. I'm not sure what you mean by "careful
integration". Would some aspects of the patch need to change?

> I also noticed a statistics-sensitivity issue while rerunning the
> large-query subset. I ran five rounds over the same 33 queries and the
> same DP / GEQO / GOO(combined) variants. Each round refreshed statistics
> before executing the measured queries. I saved the statistics snapshot
> with pg_dump --statistics-only and kept the per-round result
> spreadsheets; these are in round_artifacts-20260511.zip, organized as
> round1 through round5. The per-run measured repetitions are
> comparatively stable; the large differences below correspond to changed
> plans after refreshed statistics, not just measurement noise.
>
> Execution-time sum:
>
> | round | DP ms | GEQO ms | GOO(combined) ms |
> | --- | ---: | ---: | ---: |
> | round1 | 41091 | 574582 | 87706 |
> | round2 | 41331 | 51205 | 85038 |
> | round3 | 40493 | 95580 | 85561 |
> | round4 | 41464 | 99228 | 83930 |
> | round5 | 41611 | 126544 | 85578 |
>
> GOO(combined) has the lower execution-time sum in four of the five
> rounds. The exact bucket counts are not identical in every round, but
> each round still has both large GOO(combined) wins and large
> regressions. More importantly, GEQO varies much more across statistics
> snapshots: its execution-time sum ranges from 51s to 575s, while
> GOO(combined) stays in the 84s-88s range.

That's an interesting finding! We'll want to keep this behavior in
mind and see how reproducible it is. You mentioned you didn't want to
draw any general conclusions (reasonable), but a 10x variation just
from statistics does put a new perspective on the test results.

> As I mentioned in the previous message, the remaining GOO(combined)
> regressions fall into two groups: sometimes neither cost nor result_size
> finds a good greedy candidate, and sometimes a good greedy candidate
> exists but the final estimated-cost selector chooses a slower plan.
>
> So I am still looking at the two follow-up directions mentioned earlier,
> both building on the current GOO work. One is to improve pure-GOO
> candidate generation and the final selector, for example by adding more
> diverse strategies and making the selector consider signals beyond final
> estimated cost. The other is to use exact DP for a prefix of the problem
> and fall back to GOO when the DP budget is not enough. The first
> direction may need broader changes across planner or estimation-related
> code,

I agree we should try to avoid strategies that depend on changing other areas.

> so I am currently focusing more on the second direction and will
> share the code and numbers once I have a clearer result.

I've mentioned elsewhere that graceful degradation is a good property
(and a prerequisite for robustness), and cliff behavior is one
disadvantage of the GEQO search. However, it's not going to scale:
There will always be large join problems where a heuristic search will
find terrible plans, and that means severe regressions are possible,
as we've seen. Further, this direction opens up another large design
space that I fear will complicate testing and review, and it's already
hard enough.

So, the above doesn't seem like a logical next step from the data thus
far. (To be clear, I think it's good future work, but that presumes
getting to a first commit).

To my mind, we still need more data to inform future direction. The
most important things IMO are
- plan advice -- My current thinking is, the plan advice aspect is the
"wild card" that could immediately improve user experience with large
joins. Conversely, without the ability to fix regressions, any new
approach is risky. How well does plan advice work with GEQO? Does it
work with GOO already or would that need additional coding?
- more large join benchmarks (I see you have additional benchmarks in
your repository above)

--
John Naylor
Amazon Web Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-05-12 11:58:48 Re: (SQL/PGQ) cache lookup failed for label
Previous Message Matthias van de Meent 2026-05-12 11:47:23 Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions