Re: Is there value in having optimizer stats for joins/foreignkeys?

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
Date: 2026-02-01 16:48:17
Message-ID: 057ddb97-c5dd-4610-a8bf-845b1bd7dfc4@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/30/26 07:26, Corey Huinker wrote:
> I have indeed started by implementing MCV statistics for joins,
> because I have not found a case for joins that would benefit only from
> ndistinct or functional dependency stats that MCV stats wouldn't help.
>
> That was a big question I had, and I agree that we should only add
> statistics as uses for them become apparent.
>  

I don't have a clear opinion on this, and maybe we don't need to build
some of the stats. But I think there's a subtle point here - these stats
may not be useful for estimation of the join itself, but could be useful
for estimating the upper part of the query.

For example, imagine you have a join, with an aggregation on top.

SELECT t1.a, t2.b, COUNT(*) FROM t1 JOIN t2 ON (...) GROUP BY 1, 2;

How would you estimate the number of groups for the aggregate without
having the ndistinct estimate on the join result?

>
> In my POC patch, I've made the following catalog changes:
> - Add *stxotherrel (oid) *and *stxjoinkeys (int2vector)* fields to /
> pg_statistic_ext/
> - Use the existing *stxkeys (int2vector)* to store the stats object
> attributes of /stxotherrel/
> - Create *pg_statistic_ext_otherrel_index* on /(stxrelid, stxotherrel)/
> - Add stxdjoinmcv*(pg_join_mcv_list)* to /pg_statistic_ext_data/
>
>
> I like all these changes. Maybe "outer" rel rather than "other" rel, but
> it really doesn't matter this early on.
>

Already commented on this earlier - I don't think we want to restrict
the joins to 2-way joins. So this "outerrel" thing won't work.

>
> To use them, we can let the planner detect patterns like this:
>
> /*
> * JoinStatsMatch - Information about a detected join pattern
> * Used internally to track what was matched in a join+filter pattern
> */
> typedef struct JoinStatsMatch
> {
> Oid target_rel; /* table OID of the estimation target */
> AttrNumber targetrel_joinkey; /* target_rel's join column */
> Oid other_rel; /* table OID of the filter source */
> AttrNumber otherrel_joinkey; /* other_rel's join column */
> List *filter_attnums; /* list of AttrNumbers for filter columns in
> other_rel */
> List *filter_values; /* list of Datum constant values being filtered */
> Oid collation; /* collation for comparisons */
>
> /* Additional info to avoid duplicate work */
> List *join_rinfos; /* list of join clause RestrictInfos */
> RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
> } JoinStatsMatch;
>
> and add the detection logic in clauselist_selectivity_ext() and
> get_foreign_key_join_selectivity(). 
>
> Statistics collection indeed needs the most thinking. For the
> purpose of a POC, I added MCV join stats collection as part of ANALYZE
> of one table (stxrel in pg_statistic_ext). I can do this because MCV
> join stats are somewhat asymmetric. It allows me to have a target
> table (referencing table for foreign key join) to ANALYZE, and we can
> use the already collected MCVs of the joinkey column on the target
> table to query the rows in the other table. This greatly mitigates
> performance impact compared to actually joining two tables. However,
> if we are to support more complex joins or other types of join stats
> such as ndistinct or functional dependency, I found it hard to define
> who's the target table (referencing table) and who's the other table
> (referenced table) outside of the foreign key join scenario. So I
> think for those more complex cases eventually we may as well 
> perform the joins and collect the join stats separately. Alvaro
> Herrera suggested offline that we could have a dedicated autovacuum
> command option for collecting the join statistics.
>
>
> I agree, we have to perform the joins, as the rows collected are
> inherently dependent and skewed, and yes, it's going to be a maintenance
> overhead hit.
>
> Initially I thought this could be mitigated somewhat by retaining
> rowsamples for each table, but those row samples would be independent of
> the joins, and the values we need are inherently dependent.
>  

Joining per-table samples don't work (which doesn't mean having samples
would not be useful, but not for joins). But I already mentioned the
paper I think describes how to build a sample for a join. Now that I
think of it I might have even posted a PoC patch implementing it using
SPI or something like that - but that'd be years ago, at this point.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marcos Pegoraro 2026-02-01 17:09:16 Re: Document NULL
Previous Message Tomas Vondra 2026-02-01 16:39:38 Re: Is there value in having optimizer stats for joins/foreignkeys?