Quick Links

Re: Is there value in having optimizer stats for joins/foreignkeys?

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: Is there value in having optimizer stats for joins/foreignkeys?
Date:	2026-02-01 16:32:20
Message-ID:	5c61edbb-1328-4ce9-ab3b-97506b36e068@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 1/30/26 08:29, Corey Huinker wrote:
>
> Current Limitations
> -------------------
>
> This is a proof of concept. Known limitations include:
>
>
> I really like this proof of concept.
>
>
>
> 1. The current catalog design is not ideal. It is asymmetric (a
> "primary" and an "other" table), which is natural for FK-like joins,
> but less intuitive for other joins
>
>
> I think the asymmetry comes with the territory, and we will be creating
> the join statistics that prove useful. If that means that we create one
> object ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b
> JOIN a... then so be it.
>
>
> .
>
> 2. Stats collection piggybacks on ANALYZE of the primary table and
> uses its single-column MCV for the join key. This can be inaccurate
> when the MCV values on the "primary" side don't cover the important
> values on the other side, or when the filter column isn't fully
> dependent on the join key. A more accurate approach would execute the
> actual join during collection, which could also decouple join stats
> collection from single-table ANALYZE.
>
>
> Unfortunately, I think we will have to join the remote table_b to the
> row sample on table_a to get accurate join statistics, and the best time
> to do that when we already have the row sample from table_a. We can
> further batch up statistics objects that happen to join table_a to
> table_b by the same join criteria to avoid rescans.
>

IIRC the index-based sampling (described in the paper I mentioned) works
something like this - sample leading table, then use indexes to lookup
data from the other tables to build a statistically correct sample of
the whole join.

> Will what you have work when we want to do an MCV on a mix of local and
> remote columns, or will that require more work?
>
>
> 3. Currently limited to: equality join clauses, equality and IN filter
> clauses, simple Var stats objects (no expressions), inner joins only,
> and two-way joins only. Some of these are easier to extend; others may
> be harder or unnecessary (like n-way joins).
>
>
> I suspect that n-way joins would have very limited utility and could be
> adequately covered with multiple join-stats objects.
>

I see no clear reason why that would be the case, and it's not that hard
to construct joins on 3+ tables where stats on 2-way joins won't be good
enough. Whether those cases are sufficiently common I don't know, but I
also don't see a good reason to not address them - it should not be much
more complex than 2-way joins (famous last words, I know).

>
>
> 4. Patch 0002 (auto-creation from FK constraints) should probably be
> gated behind a GUC. I'm not strongly attached to this patch, but kept
> it because FK joins seem like a natural and common use case.
>
>
> I think this is like indexing, just because you can make all possible
> columns indexed doesn't mean you should. Tooling will emerge to
> determine what join stats objects are worth their weight, and create
> only those objects.
>

Agreed.

> If there's interest, I'm happy to continue iterating on the design.
>
> In particular, I'd welcome feedback on:
> - whether this is a direction worth pursuing,
>
>
> Yes. Very much so.
>
>
> - the catalog design,
>
>
> I had somehow gotten the impression that you were going to take the
> extended statistics format, but store individual columns in a modified
> pg_statistic. That's working for now, but I wonder if that will still be
> the case when we start to try this for columns that are arrays, ranges,
> multiranges, tsvectors, etc. pg_statistic has the infrastructure to
> handle those, but there may be good reason to keep pg_statistic focused
> on local attributes and instead just keep adding new kinds to extended
> statistic and let it be the grab-bag it was perhaps always meant to be.
>
> In my own musings on how to implement this (which you have far exceeded
> with this proof-of-concept), I had wondered how to stats for k.keyword
> and k.phonetic_code individually from the definition
> of movie_keywords2_multi_stats, but looking at what you've done I think
> we're better of with defining each statistic object very narrowly, and
> that means we define one object per remote column and one object per
> interesting combination of columns, then so be it. So long as we can
> calculate them all from the same join of the two tables, we'll avoid the
> nasty overhead.
>

Not sure I understand this. Why would this use pg_statistic at all? I
imagined we'd just treat the join as a relation, and store the stats in
pg_statistic_data_ext. The only difference is we need to store info
about the join itself (which rels / conditions), which would go into
pg_statistic_ext.

>
> - and scope (what kinds of joins / predicates are worth supporting).
>
>
> between-ish clauses ( x>= y AND x < z), etc is what immediately comes to
> mind.
>
> element_mcv for array types might be next, but that's well down the road.

Maybe. In v1 we should focus on mimicking what we have for per-relation
stats, and only then try doing something more.

regards

--
Tomas Vondra

In response to

Re: Is there value in having optimizer stats for joins/foreignkeys? at 2026-01-30 07:29:56 from Corey Huinker

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2026-02-01 16:39:38	Re: Is there value in having optimizer stats for joins/foreignkeys?
Previous Message	Tomas Vondra	2026-02-01 16:19:04	Re: Is there value in having optimizer stats for joins/foreignkeys?