Re: Is there value in having optimizer stats for joins/foreignkeys?

From: Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
Date: 2026-06-23 11:02:41
Message-ID: CAK98qZ2tRpG=H0V=wa7bd5_3KoqFdH+JL2jG7DvP3W3jYTQ=rw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

Attached is v8 of the join statistics patch. This revision focuses mainly on
the catalog-representation feedback from the earlier v7 reviews. Thanks Tom,
Tomas, Chengpeng, and Corey!

Changes since v7:

0001: a new preparatory commit (Tom + Tomas feedback). This is a
standalone prerequisite refactor, independent of join statistics.
- Removed the stxkeys column from pg_statistic_ext; all target columns
are now Var nodes in stxexprs alongside the complex expressions.
- Removed the attnames column from the pg_stats_ext view accordingly.

Tomas pointed out earlier that representing keys as expressions rather than
attnums might cost planning time, so I benchmarked 0001 to measure planning
latency, as below. Build: CFLAGS='-g -O3'.

Table DDLs:
CREATE TABLE t (a int, b int, c int, d int, e int,
f int, g int, h int, i int, j int);
INSERT INTO t SELECT
mod(i,10), mod(i,13), mod(i,17), mod(i,19), mod(i,23),
mod(i,29), mod(i,31), mod(i,37), mod(i,41), mod(i,43)
FROM generate_series(1, 100000) s(i);

Two statistics configurations (all targets are plain columns):

-- 3 stats
CREATE STATISTICS s1 (mcv) ON a, b, c FROM t;
CREATE STATISTICS s2 (ndistinct) ON a, b, c, d FROM t;
CREATE STATISTICS s3 (dependencies) ON a, b, c FROM t;
ANALYZE t;

-- 10 stats
CREATE STATISTICS s1 (mcv) ON a, b, c FROM t;
CREATE STATISTICS s2 (mcv) ON b, c, d FROM t;
CREATE STATISTICS s3 (mcv) ON c, d, e FROM t;
CREATE STATISTICS s4 (mcv) ON d, e, f FROM t;
CREATE STATISTICS s5 (mcv) ON e, f, g FROM t;
CREATE STATISTICS s6 (mcv) ON f, g, h FROM t;
CREATE STATISTICS s7 (mcv) ON g, h, i FROM t;
CREATE STATISTICS s8 (mcv) ON h, i, j FROM t;
CREATE STATISTICS s9 (ndistinct) ON a, b, c, d, e FROM t;
CREATE STATISTICS s10 (ndistinct) ON f, g, h, i, j FROM t;
ANALYZE t;

pgbench query:

EXPLAIN SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3;

pgbench command:

pgbench -n -c1 -j1 -T30 -f query.sql

Results:
Median plan latency, before 0001 (with stxkeys) vs after:

before after
3 stats: 0.122 ms 0.123 ms (+0.001 ms)
10 stats: 0.121 ms 0.125 ms (+0.004 ms)

So the ~1-4 us per plan overhead at 10 stats objects is negligible.

0002: the main join statistics patch
- stxkeyrefs is gone (Tom's feedback).
- The anchor relation's OID is now the first element of stxjoinrels
(stxjoinrels[0] == stxrelid), so every varno uniformly references
stxjoinrels[varno-1] (Tom + Tomas feedback).
- Updated permissions for creating join stats: you must own the anchor
and have SELECT on the other table(s); that SELECT is rechecked for
the stats owner at ANALYZE time, and the stats are not refreshed if
the privilege has been revoked (Tom's feedback).
- Added documentation and more test coverage.

A few questions on where to take this next:

1. N-way joins. Tomas, you made the case that 2-way isn't enough (a
fact table joining two correlated dimensions). The catalog is already
prepared for n-way, but ANALYZE skips collection with a warning.
Should extending collection to n-way be the next piece, or would you
rather see the 2-way version land first?

2. Other statistics kinds such as ndistinct and functional dependencies.
Are those other stats kinds must haves in the initial commit?

3. Index-requirement semantics (Chengpeng, Tom, Tomas). Currently a
suitable index on the probed join column is required: CREATE
STATISTICS errors if none exists, and a NORMAL dependency on the
chosen index makes DROP INDEX require CASCADE (so the stats can't
silently stop building). ANALYZE re-selects the best available index
on each run. Is this what we want? The alternatives I considered:
a. Drop the index requirement and add a non-index (sequential-scan)
sampling fallback. This is the most user-friendly, and not necessarily
slower when the joined tables are small.
b. Pin the specific index by storing its OID in the catalog. This
makes the dependency exact, but I'm unsure how it should interact with
REINDEX: REINDEX CONCURRENTLY swaps the index to a new OID — the
NORMAL dependency follows automatically, but a stored OID would need
to be updated explicitly.
c. Block DROP INDEX only when it would remove the last suitable
index (allowing it while an equivalent remains). Also user-friendly,
but I haven't found a clean way to implement it.

I'm currently inclined either to keep the present behavior for v1, or
to go with (a) and add the seq-scan fallback. Would appreciate your
thoughts.

On Mon, Jun 22, 2026 at 12:03 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

> I finally had a bit of time to take a look at this again today. AFAIK
> nothing was posted since pgconf.dev last month, so I took a look at the
> v7 to do at least something. I assume you may have an updated patch, but
> chances are it's not too different.
>

Thank you so much, Tomas, for the detailed review. I was just about to
post v8 when your feedback arrived. I addressed some of your most
recent feedback (#9, #12, #15). I didn't want you reviewing stale
code, so I'm posting what I have now and will follow up on the rest.

Best,
Alex

--
Alexandra Wang
EDB: https://www.enterprisedb.com

Attachment Content-Type Size
v8-0001-Remove-stxkeys-from-pg_statistic_ext-unify-into-s.patch application/octet-stream 79.0 KB
v8-0002-Add-join-MCV-statistics-for-selectivity-estimatio.patch application/octet-stream 250.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2026-06-23 11:35:42 Re: doc: should pg_createsubscriber be grouped as a client application?
Previous Message Jim Jones 2026-06-23 11:02:34 Re: Add PRODUCT() aggregate function