| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com> |
| Cc: | jian he <jian(dot)universality(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com> |
| Subject: | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date: | 2026-07-03 11:51:37 |
| Message-ID: | 02dd92ff-b1ee-4e69-9f7c-abc60dfb4142@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Alexandra,
On 6/23/26 13:02, Alexandra Wang wrote:
> Hi all,
>
> Attached is v8 of the join statistics patch. This revision focuses mainly on
> the catalog-representation feedback from the earlier v7 reviews. Thanks Tom,
> Tomas, Chengpeng, and Corey!
>
> Changes since v7:
>
> 0001: a new preparatory commit (Tom + Tomas feedback). This is a
> standalone prerequisite refactor, independent of join statistics.
> - Removed the stxkeys column from pg_statistic_ext; all target columns
> are now Var nodes in stxexprs alongside the complex expressions.
> - Removed the attnames column from the pg_stats_ext view accordingly.
>
Thanks for the v8 patch. I think 0001 looks pretty good, I only have a
couple minor comments / questions regarding it:
1) Does pg_get_statisticsobjdef_columns naming still make sense? AFAIK
it now returns everything, both "columns" and expressions, right? Maybe
we should get rid of the columns vs. expressions entirely, and just hve
one function showing everything at once?
2) I think the results of queries in perform.sgml are now much harder to
understand, because we're first showing column names and then the jsonb
value for stxddependencies with attnums. And it's not clear how these
two arrays map.
3) system-views.sgml talks about "target columns and expressions" but
that sounds a bit weird to me. I don't think I've heard "target" used to
describe columns an object is defined on, but maybe I'm wrong. Maybe
"columns and expressions the statistics is defined on" would work?
4) pg_get_statisticsobjdef_expressions has this:
/* Skip plain column references (but not virtual generated columns) */
if (IsA(expr, Var) && ((Var *) expr)->varattno > 0 &&
get_attgenerated(statextrec->stxrelid, ((Var *) expr)->varattno)
!= ATTRIBUTE_GENERATED_VIRTUAL)
continue;
but it's not clear to me *why* we need to skip those. I think the
comment should explain that.
5) I see we're losing a FK ...
DECLARE_ARRAY_FOREIGN_KEY((stxrelid, stxkeys), pg_attribute, (attrelid,
attnum));
That's unfortunate, but I guess we still have the dependencies.
Overall, I think 0001 goes in the right direction, and maybe we should
even apply it early, without waiting for the rest of the patch series.
It seems like an improvement.
The main question I have is whether maybe we should get rid of the
columns vs. expressions in more places. Currently the patch removes that
from the catalog, but then it still "restores" this difference when
reading the statistics information, so that both StatExtEntry and
StatisticExtInfo still store this separately. But do we need to?
It means we still need two loops in a lot of places, first over columns
and then over expressions. If we treated everything as expressions,
wouldn't it be simpler?
> Tomas pointed out earlier that representing keys as expressions rather than
> attnums might cost planning time, so I benchmarked 0001 to measure planning
> latency, as below. Build: CFLAGS='-g -O3'.
>
> Table DDLs:
> CREATE TABLE t (a int, b int, c int, d int, e int,
> f int, g int, h int, i int, j int);
> INSERT INTO t SELECT
> mod(i,10), mod(i,13), mod(i,17), mod(i,19), mod(i,23),
> mod(i,29), mod(i,31), mod(i,37), mod(i,41), mod(i,43)
> FROM generate_series(1, 100000) s(i);
>
> Two statistics configurations (all targets are plain columns):
>
> -- 3 stats
> CREATE STATISTICS s1 (mcv) ON a, b, c FROM t;
> CREATE STATISTICS s2 (ndistinct) ON a, b, c, d FROM t;
> CREATE STATISTICS s3 (dependencies) ON a, b, c FROM t;
> ANALYZE t;
>
> -- 10 stats
> CREATE STATISTICS s1 (mcv) ON a, b, c FROM t;
> CREATE STATISTICS s2 (mcv) ON b, c, d FROM t;
> CREATE STATISTICS s3 (mcv) ON c, d, e FROM t;
> CREATE STATISTICS s4 (mcv) ON d, e, f FROM t;
> CREATE STATISTICS s5 (mcv) ON e, f, g FROM t;
> CREATE STATISTICS s6 (mcv) ON f, g, h FROM t;
> CREATE STATISTICS s7 (mcv) ON g, h, i FROM t;
> CREATE STATISTICS s8 (mcv) ON h, i, j FROM t;
> CREATE STATISTICS s9 (ndistinct) ON a, b, c, d, e FROM t;
> CREATE STATISTICS s10 (ndistinct) ON f, g, h, i, j FROM t;
> ANALYZE t;
>
> pgbench query:
>
> EXPLAIN SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3;
>
> pgbench command:
>
> pgbench -n -c1 -j1 -T30 -f query.sql
>
> Results:
> Median plan latency, before 0001 (with stxkeys) vs after:
>
> before after
> 3 stats: 0.122 ms 0.123 ms (+0.001 ms)
> 10 stats: 0.121 ms 0.125 ms (+0.004 ms)
>
> So the ~1-4 us per plan overhead at 10 stats objects is negligible.
>
I agree, this looks fine. I wasn't really worried about it, it was just
the one thing that seemed like a possible issue.
> 0002: the main join statistics patch
> - stxkeyrefs is gone (Tom's feedback).
> - The anchor relation's OID is now the first element of stxjoinrels
> (stxjoinrels[0] == stxrelid), so every varno uniformly references
> stxjoinrels[varno-1] (Tom + Tomas feedback).
> - Updated permissions for creating join stats: you must own the anchor
> and have SELECT on the other table(s); that SELECT is rechecked for
> the stats owner at ANALYZE time, and the stats are not refreshed if
> the privilege has been revoked (Tom's feedback).
> - Added documentation and more test coverage.
>
I haven't looked at the 0002 in detail, AFAIK my comments from June 22
still apply.
I did however do some simple experiments, and there's a behavior I don't
quite understand. The attached script creates two correlated tables (the
non-join columns match exactly).
create table t1 (a int, b int, c int);
create table t2 (d int, e int, f int);
insert into t1
select i, mod(i,100), mod(i,100) from generate_series(1,1000000) S(i);
insert into t2
select i, mod(i,100), mod(i,100) from generate_series(1,1000000) S(i);
And then it runs queries like this:
select * from t1 join t2 on t1.a = t2.d where t1.b < X and t2.e < X
with X between 1 and 10, without/with extended statistics on (b,c,e,f).
First with default_statistics_target, then with target 10000 (maximum).
The results look like this:
val | actual | no ext stats | target=100 | target=10000
-----+--------+---------------+------------+-------------
1 | 10000 | 104 | 9700 | 10000
2 | 20000 | 393 | 9700 | 10000
3 | 30000 | 894 | 10100 | 10000
4 | 40000 | 1637 | 11500 | 10000
5 | 50000 | 2577 | 11300 | 10000
6 | 60000 | 3755 | 11000 | 10000
7 | 70000 | 5041 | 8800 | 10000
8 | 80000 | 6531 | 10900 | 10000
9 | 90000 | 8314 | 9200 | 10000
10 | 100000 | 10238 | 10217 | 10000
The "no extended stats" results are not great, but that's expected. But
the results with extended stats are a bit weird.
First, with target=100 the estimates go up and down, which seems a bit
surprising, as we're only increasing the fraction of rows (and of the
MCV) matched by the conditions. Perhaps the MCV in incomplete, and this
noise is due to which entries we end up picking for the MCV, and what
fraction we happen to match? It does seem to oscillate around 10k.
With target 10000, we use the whole join to build the MCV, so it's
complete. And indeed, there's no noise - it always estimates 10k. But
that's strange, isn't it? (For larger values of the query parameters the
estimates start growing, but it matches the "no stats" estimates.)
I haven't figured out why exactly this happens, but AFAICS it's due this
block in join_mcv_clause_selectivity:
if (covered_anchor_sel > 0 && other_rel->tuples > 0)
{
double other_denom
= Max(other_rel->tuples * covered_other_sel, 1.0);
raw_sel /= covered_anchor_sel * other_denom;
CLAMP_PROBABILITY(raw_sel);
}
So maybe it's some thinko in how it applies the anchor?
I noticed a couple more details for 0002:
- The automated statistics naming seems to not recognize plain columns
anymore, so it picks t1_expr_expr_expr_expr_stat even though the
statitstics object is on (b, c, e, f). I guess it's due to 0001.
- I didn't realize we require the user to pick the "anchor" table and
put it first in the CREATE STATISTICS command:
CREATE INDEX ON t1 (a);
CREATE STATISTICS (mcv) on t1.b, t1.c, t2.e, t2.f from t1 join t2
ON (t1.a = t2.d);
ERROR: no suitable index on "t2" column "d" for join statistics
HINT: Create an index on the join column to enable index-based join
sampling.
In this example both t1 and t2 could have be an anchor table. So maybe
we could try determining the anchor table automatically? But maybe there
are issues, e.g. what if there are multiple candidates?
- It's a bit inconvenient the statistics is listed in \d only for the
anchor table. It'd be good to list it for all tables, but maybe add an
info whether it's the anchor or not?
> A few questions on where to take this next:
>
> 1. N-way joins. Tomas, you made the case that 2-way isn't enough (a
> fact table joining two correlated dimensions). The catalog is already
> prepared for n-way, but ANALYZE skips collection with a warning.
> Should extending collection to n-way be the next piece, or would you
> rather see the 2-way version land first?
>
I don't think n-way joins have to be supported from the beginning, but
it would be good to have an experimental patch doing that in the patch
series. In my experience it helps with getting the infrastructure ready
for supporting it later.
> 2. Other statistics kinds such as ndistinct and functional dependencies.
> Are those other stats kinds must haves in the initial commit?
>
I think this is similar. I think it's fine to not support all these
statistical kinds in the initial commit, but it'd help to have some sort
of experimental patch.
> 3. Index-requirement semantics (Chengpeng, Tom, Tomas). Currently a
> suitable index on the probed join column is required: CREATE
> STATISTICS errors if none exists, and a NORMAL dependency on the
> chosen index makes DROP INDEX require CASCADE (so the stats can't
> silently stop building). ANALYZE re-selects the best available index
> on each run. Is this what we want? The alternatives I considered:
> a. Drop the index requirement and add a non-index (sequential-scan)
> sampling fallback. This is the most user-friendly, and not necessarily
> slower when the joined tables are small.
> b. Pin the specific index by storing its OID in the catalog. This
> makes the dependency exact, but I'm unsure how it should interact with
> REINDEX: REINDEX CONCURRENTLY swaps the index to a new OID — the
> NORMAL dependency follows automatically, but a stored OID would need
> to be updated explicitly.
> c. Block DROP INDEX only when it would remove the last suitable
> index (allowing it while an equivalent remains). Also user-friendly,
> but I haven't found a clean way to implement it.
>
> I'm currently inclined either to keep the present behavior for v1, or
> to go with (a) and add the seq-scan fallback. Would appreciate your
> thoughts.
>
Not sure. I think both (a) and (c) would be OK. I'd probably go with
some version of (a), i.e. pick a suitable index at ANALYZE time, and
either "fail" when there's no index or use a seqscan sampling.
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| stats.sql | application/sql | 1.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nisha Moond | 2026-07-03 11:52:40 | Re: Proposal: Conflict log history table for Logical Replication |
| Previous Message | Alvaro Herrera | 2026-07-03 11:12:39 | Re: REPACK CONCURRENTLY fails on tables with generated columns |