| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com> |
| Subject: | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date: | 2026-02-01 16:19:04 |
| Message-ID: | de2de0a1-3af9-45dd-86f1-5368b73ca89c@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Alexandra!
On 1/29/26 06:04, Alexandra Wang wrote:
> Hi hackers,
>
> As promised in my previous email, I'm sharing a proof-of-concept patch
> exploring join statistics for correlated columns across relations.
> This is a POC at this point, but I hope the performance numbers below
> give a better idea of both the potential usefulness of join statistics
> and the complexity of implementing them.
>
Great! Thanks for doing this, I think it's very worthy improvement and
I'm grateful you're interested in working on this.
> Join Order Benchmark (JOB)
> ---------------------------------------
>
> I got interested in this work after reading a 2015 VLDB paper [1] that
> introduced the Join Order Benchmark (JOB) [2]. JOB uses the IMDB
> dataset, with queries having 3-16 joins (average 8). Unlike standard
> benchmarks such as TPC-H which use randomly generated uniform and
> independent data, JOB uses real-world data with correlations and
> skewed distributions, which make cardinality estimation much harder.
>
> The paper tested PostgreSQL among other databases. I reran the
> benchmark on current PG 19devel to see where we stand today. (The
> original data source is gone, but CedarDB hosts a mirror [3]; I
> created a fork [4] and added some scripts. The schema DDLs and queries
> are unchanged.)
>
> Test setup:
> MacBook, Apple M3 Pro
> Postgres master branch (19devel), built with -O3
> cold and warm runs, 3 runs each
>
> Results (average of 3 runs):
>
> Postgres 19devel branch:
> Total execution time cold run: 140.75 s (±2.12 s)
> Total execution time warm run: 117.72 s (±0.76 s)
>
> Postgres 19devel branch, SET enable_nestloop = off:
> Total execution time cold run: 87.16 s (±0.17 s)
> Total execution time warm run: 86.68 s (±0.42 s)
>
> These results suggest that the planner is frequently choosing nested
> loop joins where hash or merge joins would be better.
>
> (As a side note: with nestloops disabled, plans are dominated by hash
> joins, so cold vs warm runs are very similar due to limited cache
> reuse.)
>
Right. I have not investigated plans for the JOB queries, but a common
issue is that we pick a nested loop very early in the plan due to a poor
estimate, and that leads to a sequence of nested loops. Which just
amplifies the problem.
Is it possible to measure the accuracy of the estimates, not just the
execution time? I think it would be useful to know how much "closer" we
get to the actual cardinality. There may well be queries where the
execution time did not change much, but in a slightly different setting
it could easily be an issue.
> Problem
> --------
>
> The slowest query in this workload is 16b.sql (attached). Its plan
> contains 7 nested loop joins.
>
> The innermost join is estimated as 34 rows but actually produces
> 41,840 rows — a ~1230x underestimation:
>
> Nested Loop (cost=6.80..3813.14 rows=34 width=4) (actual
> time=2.336..304.160 rows=41840 loops=1)
> -> Seq Scan on keyword k (cost=0.00..2685.11 rows=1 width=4) (actual
> time=0.402..6.161 rows=1.00 loops=1)
> Filter: (keyword = 'character-name-in-title')
> -> Bitmap Heap Scan on movie_keyword mk (cost=6.80..1124.98 rows=305
> width=8) (actual time=1.933..295.493 rows=41840.00 loops=1)
> Recheck Cond: (k.id <http://k.id> = keyword_id)
>
> This underestimation cascades to the outermost join: ~3K estimated vs
> ~3.7M actual rows.
>
> The problem can be reduced to the following query:
>
> SELECT * FROM movie_keyword mk, keyword k
> WHERE k.keyword = 'character-name-in-title'
> AND k.id <http://k.id> = mk.keyword_id;
>
> The tables:
>
> keyword(id PK, keyword, phonetic_code)
> movie_keyword(id PK, movie_id, keyword_id)
>
> These tables are strongly correlated via keyword_id -> keyword.id
> <http://keyword.id>. We
> can even add:
>
> ALTER TABLE movie_keyword
> ADD CONSTRAINT movie_keyword_keyword_id_fkey
> FOREIGN KEY (keyword_id) REFERENCES keyword(id);
> CREATE STATISTICS keyword_stats (dependencies)
> ON id, keyword FROM keyword;
> ANALYZE keyword, movie_keyword;
>
> However, even with the FK constraint and extended statistics on the
> keyword table, the estimate remains unchanged:
>
> SELECT * FROM check_estimated_rows(
> 'SELECT * FROM movie_keyword mk, keyword k
> WHERE k.keyword = ''character-name-in-title''
> AND k.id <http://k.id> = mk.keyword_id');
> estimated | actual
> -----------+--------
> 34 | 41840
>
> This indicates that the planner cannot connect the skewed distribution
> of the join key on the referencing table with the filter selectivity
> on the referenced table as it applies to the join result. This is the
> gap that join statistics aim to fill.
>
Agreed, I've seen various cases like this too.
> Proposed Solution
> ------------------
>
> I've attached two patches:
>
> 0001: Extend CREATE STATISTICS for join MCV statistics
+1 to do this (I'm not sure about restricting it to MCV, but that's a
detail we can discuss later)
> 0002: Automatically create join MCV statistics from FK constraints
>
I think this is very unlikely, for reasons I already explained earlier
in this thread. Essentially, building and processing the stats is a
cost, and most FKs probably don't actually need it. We don't build
indexes on FKs for similar reasons. And it's not clear to me how we
would know which columns to define the statistics on.
Of course, it's up to you to keep working on this, and try to convince
us it's a good idea. I personally think it's unlikely to get accepted,
but perhaps I'm wrong.
> 0001 is the core patch. 0002 is an optional convenience feature
> that detects FK relationships during ANALYZE and auto-creates
> join stats.
>
> Syntax:
>
> CREATE STATISTICS <name> (mcv)
> ON <other_table>.<filter_col> [, ...]
> FROM <primary_table> alias
> JOIN <other_table> alias ON (<join_condition>);
>
> Example -- single filter column:
>
> CREATE STATISTICS mk_keyword_stats (mcv)
> ON k.keyword
> FROM movie_keyword mk
> JOIN keyword k ON (mk.keyword_id = k.id <http://k.id>);
> ANALYZE movie_keyword;
>
> Example -- multiple filter columns:
>
> CREATE STATISTICS mk_multi_stats (mcv)
> ON k.keyword, k.phonetic_code
> FROM movie_keyword mk
> JOIN keyword k ON (mk.keyword_id = k.id <http://k.id>);
> ANALYZE movie_keyword;
>
> Catalog changes:
>
> pg_statistic_ext:
> - New stats kind 'c' (join MCV) in stxkind
> - New field stxotherrel (Oid): the other table in the join
> - New field stxjoinkeys (int2vector): join column pair [primary_joinkey,
> other_joinkey]
> - Existing stxkeys stores the filter column(s) on stxotherrel
> - New index pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)
>
- Why do we need a new stats kind? I'd have expected we'd use the same
'm' value as for a single-relation stats, simply because a join result
is just a relation too.
- For a PoC it's OK to have stxotherrel, but the final patch should not
be restricted to two relations. I think there needs to be an array of
the joined rel OIDs, or something like that.
FWIW I recognize figuring out the catalog changes is not easy - in fact
I always found it so intimidation I ended up procrastinating and not
making any progress on join statistics. So I really appreciate you
taking a stab at this.
> pg_statistic_ext_data:
> - New field stxdjoinmcv (pg_join_mcv_list): serialized join MCV data
>
I don't understand why we need this. AFAICS we should just treat the
join result as a relation, and use the current pg_statistic_ext_data fields.
> How stats are collected:
>
> Join statistics are collected during ANALYZE of the primary table
> (stxrelid). The current approach assumes a dependency relationship
> between the join key column and the filter column(s) on the other
> table. Specifically, during ANALYZE, we reuse the already-computed
> single-column MCV stats for the primary table's join key column. For
> each MCV value, we look up the matching row in the other table via the
> join key and extract the filter column values, carrying over the
> primary-side MCV frequency. This is not accurate in all cases, but
> works reasonably well for the POC, especially for foreign-key-like
> joins where the primary table's join key distribution is
> representative of the join result.
>
For the PoC it's good enough, but I don't think we can rely on the
per-column MCV lists like this.
> For example, the catalog contents look like:
>
> stxrelid | stxotherrel | stxjoinkeys | stxkeys | stxkind
> --------------+-------------+-------------+---------+--------
> movie_keyword | keyword | 2 1 | 2 | {c}
>
> index | values | nulls | frequency
> ------+--------------+-------+----------
> 0 | {keyword_1} | {f} | 0.06
> 1 | {keyword_2} | {f} | 0.06
> 2 | {keyword_3} | {f} | 0.06
> ... | ... | ... | ...
>
> How the planner uses join stats:
>
> I added the join pattern detection logic in
> clauselist_selectivity_ext() and get_foreign_key_join_selectivity().
> The planner looks for a pattern of:
>
> - An equijoin clause between two relations (the join key pair), and
> - Equality or IN filter clauses on columns of one relation (the filter
> columns)
>
> When this pattern is found and matching join MCV stats exist, the
> planner compares the filter values against the stored MCV items to
> compute the join selectivity, replacing the default estimate.
>
> JOB Benchmark Results
> ---------------------
>
> I created a single join statistics object for the (movie_keyword,
> keyword) pair and reran the JOB benchmark:
>
> CREATE STATISTICS movie_keyword_keyword_stats (mcv)
> ON k.keyword
> FROM movie_keyword mk
> JOIN keyword k ON (mk.keyword_id = k.id <http://k.id>);
>
> Total execution times (average of 3 runs):
>
> Cold run Warm run
> Default (baseline) 140.75s (+/-2.1) 117.72s (+/-0.8)
> enable_nestloop=off 87.16s (+/-0.2) 86.68s (+/-0.4)
> With join stats 85.81s (+/-3.4) 65.24s (+/-0.5)
>
> With a single join statistics object:
> - Cold run: -39% vs baseline (140.75s -> 85.81s),
> comparable to enable_nestloop=off
> - Warm run: -45% vs baseline (117.72s -> 65.24s),
> -25% vs enable_nestloop=off (86.68s -> 65.24s)
>
> Per-query times for the 10 slowest (out of 113) queries on baseline
> (master branch, enable_nestloop = on):
>
> Query | Baseline | No Nestloop | Join Stats | Best
> ------+-----------+-------------+------------+----------
> 16b | 10,673 ms | 2,789 ms | 2,775 ms | 3.8x JS
> 17e | 7,709 ms | 2,260 ms | 2,149 ms | 3.6x JS
> 17f | 7,506 ms | 2,304 ms | 2,460 ms | 3.3x NL
> 17a | 7,288 ms | 1,453 ms | 2,388 ms | 5.0x NL
> 17b | 6,490 ms | 1,580 ms | 5,413 ms | 4.1x NL
> 17c | 6,240 ms | 1,095 ms | 5,268 ms | 5.7x NL
> 17d | 6,261 ms | 1,263 ms | 5,291 ms | 5.0x NL
> 6d | 6,234 ms | 988 ms | 1,565 ms | 6.3x NL
> 25c | 6,133 ms | 1,848 ms | 1,738 ms | 3.5x JS
> 6f | 5,728 ms | 1,785 ms | 1,556 ms | 3.7x JS
>
> (JS = join stats fastest, NL = no-nestloop fastest)
>
> All 10 queries improved over baseline. For 16b, 3 nested loop joins
> were replaced with 2 hash joins and 1 merge join. Some queries (e.g.
> 17b) kept the same plan shape and joins but gained a Memoize node from
> better cardinality estimates.
>
Encouraging results, considering the limitations of the PoC patch. As
mentioned earlier, I'm curious how much the estimates improved, not just
the execution time.
> Current Limitations
> -------------------
>
> This is a proof of concept. Known limitations include:
>
> 1. The current catalog design is not ideal. It is asymmetric (a
> "primary" and an "other" table), which is natural for FK-like joins,
> but less intuitive for other joins.
>
Yeah, that's not ideal. I think we'd like a catalog struct that works
for joins of more than 2 tables too.
> 2. Stats collection piggybacks on ANALYZE of the primary table and
> uses its single-column MCV for the join key. This can be inaccurate
> when the MCV values on the "primary" side don't cover the important
> values on the other side, or when the filter column isn't fully
> dependent on the join key. A more accurate approach would execute the
> actual join during collection, which could also decouple join stats
> collection from single-table ANALYZE.
>
Agreed. I don't think we can piggyback on the single-column MCV lists
for the reasons you mention.
> 3. Currently limited to: equality join clauses, equality and IN filter
> clauses, simple Var stats objects (no expressions), inner joins only,
> and two-way joins only. Some of these are easier to extend; others may
> be harder or unnecessary (like n-way joins).
>
> 4. Patch 0002 (auto-creation from FK constraints) should probably be
> gated behind a GUC. I'm not strongly attached to this patch, but kept
> it because FK joins seem like a natural and common use case.
>
> Conclusion
> ----------
>
> Even with all the current limitations, I think this experiment
> suggests that join-level statistics can significantly improve plans,
> as a single join stats object can materially improve workload
> performance.
>
> If there's interest, I'm happy to continue iterating on the design.
>
> In particular, I'd welcome feedback on:
> - whether this is a direction worth pursuing,
Yes, please!
> - the catalog design,
I think we want a catalog that works for joins of more than 2 rels, etc.
> - the stats collection approach,
The best approach I'm aware of is the "Index-Based Join Sampling" paper
I mentioned earlier in this thread. I think it's OK to allow building
statistics only for joins compatible with that sampling approach.
It's not clear to me what should trigger building the stats. I assume
it'd be done by ANALYZE, but that's per-table. So would it be triggered
by ANALYZE on any of the joined tables, or just the "main" one? Would it
require a new ANALYZE option, or something else? Not sure.
> - the planner integration strategy,
I don't have a clear opinion on this yet. clauselist_selectivity_ext
seems like the right place, probably. I vaguely recall the challenge was
we either saw the per-relation restriction clauses or join clauses, but
not both at the same time?
> - and scope (what kinds of joins / predicates are worth supporting).
>
IMHO it's fine to keep the scope rather narrow, at least for v1. As long
as we can later expand it to other cases, it's fine. The extended stats
initially did not support expressions, for example.
It's a bit orthogonal, but do you have an opinion on improving estimates
for some joins by considering per-table extended statistics in the
existing selectivity estimators (e.g. in eqjoinsel_inner)? I think it
might help cases with multi-clause join clauses, or cases where we have
a multi-column MCV list on the join key + WHERE conditions.
It's separate from what this patch aims to do, but I have imagined we'd
do that as the first step, before doing the proper join stats. I'm not
suggesting you have to do that too (or before this patch).
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2026-02-01 16:32:20 | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Previous Message | David E. Wheeler | 2026-02-01 15:23:58 | Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part |