Re: Is there value in having optimizer stats for joins/foreignkeys?

From: Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
Date: 2026-01-29 05:04:19
Message-ID: CAK98qZ0LwJbUoiZjjFXitojHy4UskkjYDiSd_JZfGE9LbfZm9w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

As promised in my previous email, I'm sharing a proof-of-concept patch
exploring join statistics for correlated columns across relations.
This is a POC at this point, but I hope the performance numbers below
give a better idea of both the potential usefulness of join statistics
and the complexity of implementing them.

Join Order Benchmark (JOB)
---------------------------------------

I got interested in this work after reading a 2015 VLDB paper [1] that
introduced the Join Order Benchmark (JOB) [2]. JOB uses the IMDB
dataset, with queries having 3-16 joins (average 8). Unlike standard
benchmarks such as TPC-H which use randomly generated uniform and
independent data, JOB uses real-world data with correlations and
skewed distributions, which make cardinality estimation much harder.

The paper tested PostgreSQL among other databases. I reran the
benchmark on current PG 19devel to see where we stand today. (The
original data source is gone, but CedarDB hosts a mirror [3]; I
created a fork [4] and added some scripts. The schema DDLs and queries
are unchanged.)

Test setup:
MacBook, Apple M3 Pro
Postgres master branch (19devel), built with -O3
cold and warm runs, 3 runs each

Results (average of 3 runs):

Postgres 19devel branch:
Total execution time cold run: 140.75 s (±2.12 s)
Total execution time warm run: 117.72 s (±0.76 s)

Postgres 19devel branch, SET enable_nestloop = off:
Total execution time cold run: 87.16 s (±0.17 s)
Total execution time warm run: 86.68 s (±0.42 s)

These results suggest that the planner is frequently choosing nested
loop joins where hash or merge joins would be better.

(As a side note: with nestloops disabled, plans are dominated by hash
joins, so cold vs warm runs are very similar due to limited cache
reuse.)

Problem
--------

The slowest query in this workload is 16b.sql (attached). Its plan
contains 7 nested loop joins.

The innermost join is estimated as 34 rows but actually produces
41,840 rows — a ~1230x underestimation:

Nested Loop (cost=6.80..3813.14 rows=34 width=4) (actual
time=2.336..304.160 rows=41840 loops=1)
-> Seq Scan on keyword k (cost=0.00..2685.11 rows=1 width=4) (actual
time=0.402..6.161 rows=1.00 loops=1)
Filter: (keyword = 'character-name-in-title')
-> Bitmap Heap Scan on movie_keyword mk (cost=6.80..1124.98 rows=305
width=8) (actual time=1.933..295.493 rows=41840.00 loops=1)
Recheck Cond: (k.id = keyword_id)

This underestimation cascades to the outermost join: ~3K estimated vs
~3.7M actual rows.

The problem can be reduced to the following query:

SELECT * FROM movie_keyword mk, keyword k
WHERE k.keyword = 'character-name-in-title'
AND k.id = mk.keyword_id;

The tables:

keyword(id PK, keyword, phonetic_code)
movie_keyword(id PK, movie_id, keyword_id)

These tables are strongly correlated via keyword_id -> keyword.id. We
can even add:

ALTER TABLE movie_keyword
ADD CONSTRAINT movie_keyword_keyword_id_fkey
FOREIGN KEY (keyword_id) REFERENCES keyword(id);
CREATE STATISTICS keyword_stats (dependencies)
ON id, keyword FROM keyword;
ANALYZE keyword, movie_keyword;

However, even with the FK constraint and extended statistics on the
keyword table, the estimate remains unchanged:

SELECT * FROM check_estimated_rows(
'SELECT * FROM movie_keyword mk, keyword k
WHERE k.keyword = ''character-name-in-title''
AND k.id = mk.keyword_id');
estimated | actual
-----------+--------
34 | 41840

This indicates that the planner cannot connect the skewed distribution
of the join key on the referencing table with the filter selectivity
on the referenced table as it applies to the join result. This is the
gap that join statistics aim to fill.

Proposed Solution
------------------

I've attached two patches:

0001: Extend CREATE STATISTICS for join MCV statistics
0002: Automatically create join MCV statistics from FK constraints

0001 is the core patch. 0002 is an optional convenience feature
that detects FK relationships during ANALYZE and auto-creates
join stats.

Syntax:

CREATE STATISTICS <name> (mcv)
ON <other_table>.<filter_col> [, ...]
FROM <primary_table> alias
JOIN <other_table> alias ON (<join_condition>);

Example -- single filter column:

CREATE STATISTICS mk_keyword_stats (mcv)
ON k.keyword
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

Example -- multiple filter columns:

CREATE STATISTICS mk_multi_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

Catalog changes:

pg_statistic_ext:
- New stats kind 'c' (join MCV) in stxkind
- New field stxotherrel (Oid): the other table in the join
- New field stxjoinkeys (int2vector): join column pair [primary_joinkey,
other_joinkey]
- Existing stxkeys stores the filter column(s) on stxotherrel
- New index pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)

pg_statistic_ext_data:
- New field stxdjoinmcv (pg_join_mcv_list): serialized join MCV data

How stats are collected:

Join statistics are collected during ANALYZE of the primary table
(stxrelid). The current approach assumes a dependency relationship
between the join key column and the filter column(s) on the other
table. Specifically, during ANALYZE, we reuse the already-computed
single-column MCV stats for the primary table's join key column. For
each MCV value, we look up the matching row in the other table via the
join key and extract the filter column values, carrying over the
primary-side MCV frequency. This is not accurate in all cases, but
works reasonably well for the POC, especially for foreign-key-like
joins where the primary table's join key distribution is
representative of the join result.

For example, the catalog contents look like:

stxrelid | stxotherrel | stxjoinkeys | stxkeys | stxkind
--------------+-------------+-------------+---------+--------
movie_keyword | keyword | 2 1 | 2 | {c}

index | values | nulls | frequency
------+--------------+-------+----------
0 | {keyword_1} | {f} | 0.06
1 | {keyword_2} | {f} | 0.06
2 | {keyword_3} | {f} | 0.06
... | ... | ... | ...

How the planner uses join stats:

I added the join pattern detection logic in
clauselist_selectivity_ext() and get_foreign_key_join_selectivity().
The planner looks for a pattern of:

- An equijoin clause between two relations (the join key pair), and
- Equality or IN filter clauses on columns of one relation (the filter
columns)

When this pattern is found and matching join MCV stats exist, the
planner compares the filter values against the stored MCV items to
compute the join selectivity, replacing the default estimate.

JOB Benchmark Results
---------------------

I created a single join statistics object for the (movie_keyword,
keyword) pair and reran the JOB benchmark:

CREATE STATISTICS movie_keyword_keyword_stats (mcv)
ON k.keyword
FROM movie_keyword mk
JOIN keyword k ON (mk.keyword_id = k.id);

Total execution times (average of 3 runs):

Cold run Warm run
Default (baseline) 140.75s (+/-2.1) 117.72s (+/-0.8)
enable_nestloop=off 87.16s (+/-0.2) 86.68s (+/-0.4)
With join stats 85.81s (+/-3.4) 65.24s (+/-0.5)

With a single join statistics object:
- Cold run: -39% vs baseline (140.75s -> 85.81s),
comparable to enable_nestloop=off
- Warm run: -45% vs baseline (117.72s -> 65.24s),
-25% vs enable_nestloop=off (86.68s -> 65.24s)

Per-query times for the 10 slowest (out of 113) queries on baseline
(master branch, enable_nestloop = on):

Query | Baseline | No Nestloop | Join Stats | Best
------+-----------+-------------+------------+----------
16b | 10,673 ms | 2,789 ms | 2,775 ms | 3.8x JS
17e | 7,709 ms | 2,260 ms | 2,149 ms | 3.6x JS
17f | 7,506 ms | 2,304 ms | 2,460 ms | 3.3x NL
17a | 7,288 ms | 1,453 ms | 2,388 ms | 5.0x NL
17b | 6,490 ms | 1,580 ms | 5,413 ms | 4.1x NL
17c | 6,240 ms | 1,095 ms | 5,268 ms | 5.7x NL
17d | 6,261 ms | 1,263 ms | 5,291 ms | 5.0x NL
6d | 6,234 ms | 988 ms | 1,565 ms | 6.3x NL
25c | 6,133 ms | 1,848 ms | 1,738 ms | 3.5x JS
6f | 5,728 ms | 1,785 ms | 1,556 ms | 3.7x JS

(JS = join stats fastest, NL = no-nestloop fastest)

All 10 queries improved over baseline. For 16b, 3 nested loop joins
were replaced with 2 hash joins and 1 merge join. Some queries (e.g.
17b) kept the same plan shape and joins but gained a Memoize node from
better cardinality estimates.

Current Limitations
-------------------

This is a proof of concept. Known limitations include:

1. The current catalog design is not ideal. It is asymmetric (a
"primary" and an "other" table), which is natural for FK-like joins,
but less intuitive for other joins.

2. Stats collection piggybacks on ANALYZE of the primary table and
uses its single-column MCV for the join key. This can be inaccurate
when the MCV values on the "primary" side don't cover the important
values on the other side, or when the filter column isn't fully
dependent on the join key. A more accurate approach would execute the
actual join during collection, which could also decouple join stats
collection from single-table ANALYZE.

3. Currently limited to: equality join clauses, equality and IN filter
clauses, simple Var stats objects (no expressions), inner joins only,
and two-way joins only. Some of these are easier to extend; others may
be harder or unnecessary (like n-way joins).

4. Patch 0002 (auto-creation from FK constraints) should probably be
gated behind a GUC. I'm not strongly attached to this patch, but kept
it because FK joins seem like a natural and common use case.

Conclusion
----------

Even with all the current limitations, I think this experiment
suggests that join-level statistics can significantly improve plans,
as a single join stats object can materially improve workload
performance.

If there's interest, I'm happy to continue iterating on the design.

In particular, I'd welcome feedback on:
- whether this is a direction worth pursuing,
- the catalog design,
- the stats collection approach,
- the planner integration strategy,
- and scope (what kinds of joins / predicates are worth supporting).

Best,
Alex

[1] https://www.vldb.org/pvldb/vol9/p204-leis.pdf
[2] https://github.com/gregrahn/join-order-benchmark
[3] https://cedardb.com/docs/example_datasets/job/
[4] https://github.com/l-wang/join-order-benchmark

--
Alexandra Wang
EDB: https://www.enterprisedb.com

Attachment Content-Type Size
16b_nl.txt text/plain 5.6 KB
v1-0002-Automatically-create-join-MCV-statistics-from-FK-.patch application/octet-stream 33.2 KB
v1-0001-Extend-CREATE-STATISTICS-syntax-for-join-MCV-stat.patch application/octet-stream 127.5 KB
16b.sql application/octet-stream 668 bytes
16b_js.txt text/plain 9.4 KB
16b_baseline.txt text/plain 5.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-01-29 05:12:46 Re: ATPrepCmd: cleanup unreachable AT_AddIndexConstraint handling
Previous Message Alexander Lakhin 2026-01-29 05:00:00 Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)