Quick Links

WIP: replacing join_collapse_limit with "join hardness" estimate

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>
Subject:	WIP: replacing join_collapse_limit with "join hardness" estimate
Date:	2026-07-01 19:17:55
Message-ID:	0a377176-bc3c-481d-a4d8-693fde757807@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Our join search approach relies on join_collapse_limit to split it into
smaller join problems that the standard_join_search() can process in an
acceptable amount of time. The risk is that we'll have to explore too
many join orders, given by the formula

N! * C(N-1) = (2*N - 2)! / (N - 1)!

where N is the number of relations, and C(.) is the Catalan number.
Which grows pretty quickly:

N N! * C(N-1)
2 2
3 12
4 120
5 1,680
6 30,240
7 665,280
8 17,297,280
9 518,918,400
10 17,643,225,600

Of course, this is an upper boundary. The formula completely ignores the
structure of the join - which relations are connected by clauses etc.

There's no way we could explore hundreds of millions of join orders. For
one, it'd take quite a bit of time, but also because of memory - we tend
not to free memory during planning (and a lot of it we even can't free).

And for the purpose of preventing such disasters, join_collapse_limit=8
works pretty well.

But there's a downside - it can force a pretty bad plan. If a selective
join gets added at the end of a query, past the join_collapse_limit, we
won't be able to do this join early. This is harmful, and it's a source
of "performance cliffs", where a small change in the SQL results in
massive changes in massive drop in performance.

This is a widely known issue, and there's not much users can do about
it. They can either keep join_collapse_limit set to a conservative value
(and keep having this issue), or increase the limit (and face the risk
of expensive planning / OOM).

Yes, they could rewrite the queries, but that's a terrible solution. It
assumes the users know the selectivity of the joins (which they don't)
and that they have direct control over the SQL (while queries are often
generated by some sort of ORM). Also, one of the SQL benefits is that
it's declarative, and the optimizer does these decisions on behalf of
the user (and we shouldn't be shifting this to the users).

However, the formula above is just an upper bound (and a very weak one).

You can get that number for "cliques", i.e. joins where every relation
is connected to every other relation. But such queries are exceptionally
rare. Most joins are "sparse", with each relation connected to very few
other relations. That substantially reduces the number of reasonable
join orderings, explored by the exhaustive search. There are also
restrictions implied by non-inner joins. All of that restricts the
number of joins the exhaustive search would have to evaluate.

But the join_collapse_limit only considers the number of relations,
ignores the join structure, and in order to prevent disasters it
prevents optimization even for easy joins.

It is a very blunt instrument - it imposes the same limit on all joins,
no matter how "hard" the particular join problem really is.

join hardness / DPccp
---------------------

What if we had a (relatively) cheap way to calculate a much better
estimate of how hard the join is, using the join graph?

Let's call it a "join hardness", and define it as the number of join
relations explored by the join search, i.e. pretty much the number of
make_join_rel() calls. This is a good way to measure how expensive the
join search is, because building the join rels is the expensive part and
the plan time really is proportional to this number (and I have the data
to show that).

So, how do we calculate that quickly?

One way would be to do a "dry run" of standard_join_search(), skipping
as much of the expensive stuff as possible (path construction etc.) but
that's still damn expensive. And also invasive, because we do build some
of the paths in weird places.

Luckily, in 2006, G. Moerkotte and T. Neumann published a very
interesting paper:

Analysis of Two Existing and One New Dynamic Programming Algorithm
for the Generation of Optimal Bushy Join Trees without Cross
Products
https://www.vldb.org/conf/2006/p930-moerkotte.pdf

In that paper they describe the generation of join pairs as a graph
problem (relations are vertices, edges mean a join clause, ...). The
paper also describes an algorithm called DPccp enumerating all possible
joins, represented as pairs or graphs, connected subgraph + connected
complement (csg+ccp).

This algorithm is unbelievably efficient. Probably orders of magnitude
faster than my "dry run" experiment with standard_join_search(). And it
gives a much tighter upper boundary on the join hardness - it's still an
upper boundary, and maybe it's ~100x off. But it's still waaay better
than the N!*C(N-1) one.

If we want to use this to replace join_collapse_limit, we can make it
even cheaper. We don't need to know the actual number. We just need to
know if it's higher than our "hardness limit". Reasonable values would
be in the ballpark 10k-100k, I think. So we'd run the DPccp algorithm,
and either it completes and returns a lower value (which means it didn't
need to do much work), or it aborts once it exceeds the threshold. And
we know the join is "too hard".

FWIW the idea of using DPccp this way is not my idea. It's proposed in a
different paper by T. Neumann and B. Radke from 2018:

Adaptive Optimization of Very Large Join Queries
https://db.in.tum.de/~radke/papers/hugejoins.pdf

See the page 4, where they say (the [21] is the 2006 paper):

However, what we can do reasonably cheaply is counting the
number of connected subgraphs of a query graph. The number of
connected subgraphs is identical to the size of the full dynamic
programming table [21], i.e., the memory consumption of the DP
algorithm, and indirectly determines its optimization time. If
the number of connected subgraphs is reasonably small, for example
up to 10,000, we know that a graph-based DP algorithm will be fast.
This is the case for the examples given above, i.e., chain queries
with up to 100 relations and clique queries with less than 14
relations.

So maybe it's not a totally wild idea.

PoC patch series
----------------

To show this is totally doable, I've implemented the attached WIP patch
series implementing this as a join_search_hook.

The primary goal is to allow calculating the hardness estimate, and
showing how it aligns with the "actual" hardness, i.e. the number of
make_join_rel calls. There is a (more) experimental part trying to then
split the join problems using the hardness, but that's very hacky (and
I'll get back to that in a bit).

The patch series has several commits, with most of the DPccp code in the
0001 patch, and then some adjustments to handle clause-less joins etc.

v1-0001-Estimate-join-search-hardness-using-DPccp.patch
- basic DPccp algorithm, prints the estimate

v1-0002-Count-joinrels-constructed-by-standard_join_searc.patch
- allows comparing the estimate vs. actual hardness

v1-0003-Split-joins-estimated-to-be-too-hard.patch
- experimental/PoC: splits the join based on hardness

v1-0004-Fast-path-out-for-small-joins.patch
v1-0005-Correctly-handle-clause-less-joins.patch
- improvements to 0001

v1-0006-join-search-benchmark.patch
- benchmarking script, generating all kinds of random joins of different
sizes, collecting the estimate + actual hardness

There's a couple GUCs to make testing and experiments easier:

join_search.enable = off - calculate (and log) the DPccp estimate
join_search.threshold = 10k - when is a join considered "hard"?
join_search.split = off - split "hard" join problems?
join_search.fast = off - quick exit for provably easy joins

If you want to test it, it's enough to load the module and set the
"enable" flag to "on". If you apply 0002, you'll also get elog(WARNING)
with the actual hardness, calculated during standard_join_search().

The (ugly) benchmark script generates joins with randomized parameters
(number of join clauses, join types, etc.). The script extracts all
kinds of information from the server log (not pretty, but it works).

I'm open to ideas regarding what other joins it might generate. There
may be interesting joins the script does not generate. I'd love to know
about those, so that I can test them.

Note: I've been running the tests with a number of patches reducing the
memory usage in planner, as discussed in

https://www.postgresql.org/message-id/8b2c8e08-b02f-4418-952f-be8eab31e887%40vondra.me

Without these patches a lot of the larger joins got killed because of
OOM (even on a machine with 64GB of RAM).

I have about a zillion interesting charts visualizing the results, but
I'm going to discuss only a couple most interesting ones. Those are in
the attached PDF, in matching sections.

1) accuracy of the hardness estimate

The first question is - is it really an upper bound? And how close it
generally gets to the actual hardness.

See the two charts with estimated (x) vs. actual hardness (y). The chart
on the left is linear, log-scale on the right.

It's clear the estimate is sound - it really is an upper bound. There's
not a single case where the estimate would be lower than the "actual"
hardness per standard_join_search. This is great, it makes it safe for
the purpose of replacing join_collapse_limit.

The estimate is also pretty accurate. For most of the generated joins
we're within 1-2 orders of magnitude. That may sound weak, but recall
the current estimate "effectively" imposed by join_collapse_limit is
likely many orders of magnitude worse (especially for larger joins).

(The script was generating joins of up to 16 tables, and the poor
estimate was ~17e9 with just 10 tables. All the numbers are way lower
than that.)

2) hardness by number of joined tables

Next there are two charts showing the hardness by the number of tables
joined. First fort the actual hardness (per standard_join_search), and
then also for the DPccp estimate.

The charts look very similar. Of course, there are some small
differences because the estimate is an upper boundary, which is visible
especially for 14 tables.

An interesting question is "How large joins would we handle with a given
budget?" And these two charts can answer that (particularly the second
one, for estimated hardness).

If we set the budget to 10k, that'd cover all joins of 9 relations (so a
bit more than the current join_collapse_limit default. But it'd also
handle many of the larger joins - all the data points below the 10k
horizontal line.

With a 100k budget, we'd handle all joins up to 11 relations, plus many
of the larger joins.

3) relative error (= estimate / actual)

The error grows with the number of relations (and with the estimate). I
don't think this is very surprising, but it's worth pointing out.

4) accuracy by join types

Another interesting aspect to look at is join types. The script
generates INNER, LEFT and RIGHT joins, so how do these affect the
estimate accuracy?

The four charts plot the relative error (= estimate / actual) vs.
fraction of the joins of a given type. So 0.0 means there are no charts
of the given type, 1.0 means all joins are of that type. This is better
than plotting the raw number of joins, because that's misleading for
joins of different size.

One interesting thing is that LEFT joins behave rather differently from
INNER and RIGHT joins. If there are only LEFT joins (fraction 1.0), the
range of relative errors is much larger (by two orders of magnitude). If
there are any other joins (INNER or RIGHT), the relative error drops
below ~100 (and usually below 10).

I'm a bit puzzled by this. I thought LEFT joins and RIGHT joins are
quite symmetrical, i.e. that we "transform" RIGHT joins to LEFT joins.
So I find this difference in behavior a bit strange ...

Note: The script does not generate FULL joins, because those forcefully
split the join into subproblems. I now realized that may not be quite
true when handling the FromExpr, so I plan to adjust that, but the
current results don't have any FULL joins.

5) cost of estimate

I started this by suggesting DPccp is extremely efficient, much much
faster than the regular standard_join_search. But by how much?

Here's two charts plotting the estimation time and the full planning
time (with x axis being the estimated hardness). The second chart shows
estimation time with a 100k budget (which is why it "levels off").

Note that the charts are logscale, and that each grid step is two orders
of magnitude. The plan time spread increases with a growing hardness
estimate (which is reasonable, it just means some of the estimates are
higher than actual hardness).

With the 100k budget we're at ~1ms (this is a ryzen5 machine, BTW). And
for the cheaper joins we're at ~1% of the plan time. For the larger
joins we're getting closer, but I'd bet it's negligible compared to the
actual execution time.

Note: There's a minor issue - the estimate is included in the plan time
(as reported by log_planner_stats). For most cases it's negligible, but
for some of the joins with estimate much higher than real hardness is
may be getting somewhat close. I still think it'll be negligible with
respect to the overall execution time.

Where to apply join_collapse_limit?
-----------------------------------

One important detail I didn't mention yet is that join_collapse_limit is
applied very early, in deconstruct_jointree(). I claim that may be too
early, and that we should consider delaying it until much later.
Possibly to make_one_rel(), or right before it.

To build the join graph we need information that is not available in
deconstruct_jointree(). We need to call have_relevant_joinclause,
have_join_order_restriction and possibly other similar functions.

The join_hardness module installs join_search_hook, which is called in
make_one_rel() when it builds join rels for the join subproblems. And to
get the whole query as a single list (modulo FULL joins), you have to
set the join_collapse_limit high enough so that deconstruct_jointree
does not split it early.

I suppose it might be possible to build the graph and calculate the
estimate in deconstruct_jointree. It'd however be redundant (and likely
less efficient) than when using have_relevant_joinclause etc.

But I think there are other reasons why I think it might be better to
postpone the "splitting" to make_one_rel/make_rel_from_joinlist.

I've been looking at various optimizations of the join search, like the
starjoin optimization. The also require looking at the join graph, and
various other details (e.g. foreign keys, or even selectivity of the
joins). Which we only deduce sometime after deconstruct_jointree().

The whole idea of the starjoin optimization is to recognize a pattern in
the join graph, and reducing the join hardness by defining a canonical
join order for the starjoin cluster. It'd be a bit silly to calculate
the hardness, and only then do the starjoin stuff (or whatever else for
other join patterns). That way we'd be able to handle larger joins.

And finally, I'm working on replacing the join search with a LinDP++
approach (per papers authored again by Neumann, Radke, Stoian and
Birler). I'm not ready to share that patch series yet, but that also
requires building join (hyper)grahp, and various other information (e.f.
for costing the joins) not available that early.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
v1-0001-Estimate-join-search-hardness-using-DPccp.patch	text/x-patch	21.0 KB
v1-0002-Count-joinrels-constructed-by-standard_join_searc.patch	text/x-patch	4.5 KB
v1-0003-Split-joins-estimated-to-be-too-hard.patch	text/x-patch	11.5 KB
v1-0004-Fast-path-out-for-small-joins.patch	text/x-patch	4.5 KB
v1-0005-Correctly-handle-clause-less-joins.patch	text/x-patch	6.0 KB
v1-0006-join-search-benchmark.patch	text/x-patch	9.7 KB
hardness-eval.pdf	application/pdf	355.3 KB

Responses

Re: WIP: replacing join_collapse_limit with "join hardness" estimate at 2026-07-02 14:34:41 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Erik Wienhold	2026-07-01 19:36:23	psql: Fix \df tab completion for procedures
Previous Message	Robert Haas	2026-07-01 19:11:17	satisfies_hash_partition crash