Quick Links

Re: Is there value in having optimizer stats for joins/foreignkeys?

From:	Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To:	Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: Is there value in having optimizer stats for joins/foreignkeys?
Date:	2026-01-30 06:26:22
Message-ID:	CADkLM=fRCC3m3DJPPgYEHOJy-+932Q4RMm3Vqz7BQ_JPFASPWg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>
> I have indeed started by implementing MCV statistics for joins,
> because I have not found a case for joins that would benefit only from
> ndistinct or functional dependency stats that MCV stats wouldn't help.
>

That was a big question I had, and I agree that we should only add
statistics as uses for them become apparent.

>
> In my POC patch, I've made the following catalog changes:
> - Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to
> *pg_statistic_ext*
> - Use the existing *stxkeys (int2vector)* to store the stats object
> attributes of *stxotherrel*
> - Create *pg_statistic_ext_otherrel_index* on *(stxrelid, stxotherrel)*
> - Add stxdjoinmcv* (pg_join_mcv_list)* to *pg_statistic_ext_data*
>

I like all these changes. Maybe "outer" rel rather than "other" rel, but it
really doesn't matter this early on.

>
> To use them, we can let the planner detect patterns like this:
>
> /*
> * JoinStatsMatch - Information about a detected join pattern
> * Used internally to track what was matched in a join+filter pattern
> */
> typedef struct JoinStatsMatch
> {
> Oid target_rel; /* table OID of the estimation target */
> AttrNumber targetrel_joinkey; /* target_rel's join column */
> Oid other_rel; /* table OID of the filter source */
> AttrNumber otherrel_joinkey; /* other_rel's join column */
> List *filter_attnums; /* list of AttrNumbers for filter columns in other_rel */
> List *filter_values; /* list of Datum constant values being filtered */
> Oid collation; /* collation for comparisons */
>
> /* Additional info to avoid duplicate work */
> List *join_rinfos; /* list of join clause RestrictInfos */
> RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
> } JoinStatsMatch;
>
> and add the detection logic in clauselist_selectivity_ext() and
> get_foreign_key_join_selectivity().
>
> Statistics collection indeed needs the most thinking. For the
> purpose of a POC, I added MCV join stats collection as part of ANALYZE
> of one table (stxrel in pg_statistic_ext). I can do this because MCV
> join stats are somewhat asymmetric. It allows me to have a target
> table (referencing table for foreign key join) to ANALYZE, and we can
> use the already collected MCVs of the joinkey column on the target
> table to query the rows in the other table. This greatly mitigates
> performance impact compared to actually joining two tables. However,
> if we are to support more complex joins or other types of join stats
> such as ndistinct or functional dependency, I found it hard to define
> who's the target table (referencing table) and who's the other table
> (referenced table) outside of the foreign key join scenario. So I
> think for those more complex cases eventually we may as well
> perform the joins and collect the join stats separately. Alvaro
> Herrera suggested offline that we could have a dedicated autovacuum
> command option for collecting the join statistics.
>

I agree, we have to perform the joins, as the rows collected are inherently
dependent and skewed, and yes, it's going to be a maintenance overhead hit.

Initially I thought this could be mitigated somewhat by retaining
rowsamples for each table, but those row samples would be independent of
the joins, and the values we need are inherently dependent.

> I have experimented with two ways to define the join statistics:
>
> 1. Use CREATE STATISTICS:
>
> CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON {
> table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM
> table_name1 JOIN table_name2 ON table_name1.column_name3 =
> table_name2.column_name4
>

We'll need to support aliases because there could be a self-join :(

>
> Examples:
> -- Create join MCV statistics on a single filter column (keyword)
> CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
> ON k.keyword
> FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
> ANALYZE movie_keyword;
>
> -- Create join MCV statistics on multiple filter columns (keyword +
> phonetic_code):
> CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
> ON k.keyword, k.phonetic_code
> FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
> ANALYZE movie_keyword;
>

This is where the existing CREATE STATISTICS syntax does not serve our
purposes well. We definitely want MCV stats for both of those k.* columns
with the skew of values that join to move_keyword on that defined foreign
key, but we'd end up getting the _combinations_ of keyword, phonetic_code,
which we don't necessarily care about.

We might want an alternate syntax

CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON keyword, phonetic_code
[FROM movie_keyword]
USING movie_keyword_fk;

In this case, the FROM clause is redundant and therefore optional, the
columns listed must exist on the confrelid and the object is keyed to the
conrelid. Having said that, I think people don't really use constraint
names and so the join syntax will likely be used more often.

2. Auto join stats creation for Foreign Key + Functional Dependency stats
>
> Initially, I did not implement the CREATE TABLE STATISTICS command to
> create the join stats. Instead, I’ve implemented logic in ANALYZE to
> detect functional dependency stats on the referenced table through FKs
> and create join statistics implicitly for those cases.
>

I'm not excited about this, and others have expressed concern that it would
lead to an explosion of mediocre statistics objects.

In response to

Re: Is there value in having optimizer stats for joins/foreignkeys? at 2025-12-03 19:01:13 from Alexandra Wang

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dilip Kumar	2026-01-30 06:35:59	Re: Proposal: Conflict log history table for Logical Replication
Previous Message	Xuneng Zhou	2026-01-30 06:01:09	Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery