Re: Is there value in having optimizer stats for joins/foreignkeys?

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Is there value in having optimizer stats for joins/foreignkeys?
Date: 2026-01-30 07:29:56
Message-ID: CADkLM=eAc5DCYUh4fc7eToMuPdNA7B_9fthXDsR8QVNZyuo+Dw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> Current Limitations
> -------------------
>
> This is a proof of concept. Known limitations include:
>

I really like this proof of concept.

> 1. The current catalog design is not ideal. It is asymmetric (a
> "primary" and an "other" table), which is natural for FK-like joins,
> but less intuitive for other joins
>

I think the asymmetry comes with the territory, and we will be creating the
join statistics that prove useful. If that means that we create one object
ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b JOIN a...
then so be it.

> .
>
> 2. Stats collection piggybacks on ANALYZE of the primary table and
> uses its single-column MCV for the join key. This can be inaccurate
> when the MCV values on the "primary" side don't cover the important
> values on the other side, or when the filter column isn't fully
> dependent on the join key. A more accurate approach would execute the
> actual join during collection, which could also decouple join stats
> collection from single-table ANALYZE.
>

Unfortunately, I think we will have to join the remote table_b to the row
sample on table_a to get accurate join statistics, and the best time to do
that when we already have the row sample from table_a. We can further batch
up statistics objects that happen to join table_a to table_b by the same
join criteria to avoid rescans.

Will what you have work when we want to do an MCV on a mix of local and
remote columns, or will that require more work?

> 3. Currently limited to: equality join clauses, equality and IN filter
> clauses, simple Var stats objects (no expressions), inner joins only,
> and two-way joins only. Some of these are easier to extend; others may
> be harder or unnecessary (like n-way joins).
>

I suspect that n-way joins would have very limited utility and could be
adequately covered with multiple join-stats objects.

>
> 4. Patch 0002 (auto-creation from FK constraints) should probably be
> gated behind a GUC. I'm not strongly attached to this patch, but kept
> it because FK joins seem like a natural and common use case.
>

I think this is like indexing, just because you can make all possible
columns indexed doesn't mean you should. Tooling will emerge to determine
what join stats objects are worth their weight, and create only those
objects.

If there's interest, I'm happy to continue iterating on the design.
>
> In particular, I'd welcome feedback on:
> - whether this is a direction worth pursuing,
>

Yes. Very much so.

> - the catalog design,
>

I had somehow gotten the impression that you were going to take the
extended statistics format, but store individual columns in a modified
pg_statistic. That's working for now, but I wonder if that will still be
the case when we start to try this for columns that are arrays, ranges,
multiranges, tsvectors, etc. pg_statistic has the infrastructure to handle
those, but there may be good reason to keep pg_statistic focused on local
attributes and instead just keep adding new kinds to extended statistic and
let it be the grab-bag it was perhaps always meant to be.

In my own musings on how to implement this (which you have far exceeded
with this proof-of-concept), I had wondered how to stats for k.keyword and
k.phonetic_code individually from the definition
of movie_keywords2_multi_stats, but looking at what you've done I think
we're better of with defining each statistic object very narrowly, and that
means we define one object per remote column and one object per interesting
combination of columns, then so be it. So long as we can calculate them all
from the same join of the two tables, we'll avoid the nasty overhead.

> - and scope (what kinds of joins / predicates are worth supporting).
>

between-ish clauses ( x>= y AND x < z), etc is what immediately comes to
mind.

element_mcv for array types might be next, but that's well down the road.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-01-30 07:49:16 Re: Wake up backends immediately when sync standbys decrease
Previous Message Boris Mironov 2026-01-30 07:06:30 Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)