| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com> |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com> |
| Subject: | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date: | 2026-02-01 16:32:20 |
| Message-ID: | 5c61edbb-1328-4ce9-ab3b-97506b36e068@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 1/30/26 08:29, Corey Huinker wrote:
>
> Current Limitations
> -------------------
>
> This is a proof of concept. Known limitations include:
>
>
> I really like this proof of concept.
>
>
>
> 1. The current catalog design is not ideal. It is asymmetric (a
> "primary" and an "other" table), which is natural for FK-like joins,
> but less intuitive for other joins
>
>
> I think the asymmetry comes with the territory, and we will be creating
> the join statistics that prove useful. If that means that we create one
> object ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b
> JOIN a... then so be it.
>
>
> .
>
> 2. Stats collection piggybacks on ANALYZE of the primary table and
> uses its single-column MCV for the join key. This can be inaccurate
> when the MCV values on the "primary" side don't cover the important
> values on the other side, or when the filter column isn't fully
> dependent on the join key. A more accurate approach would execute the
> actual join during collection, which could also decouple join stats
> collection from single-table ANALYZE.
>
>
> Unfortunately, I think we will have to join the remote table_b to the
> row sample on table_a to get accurate join statistics, and the best time
> to do that when we already have the row sample from table_a. We can
> further batch up statistics objects that happen to join table_a to
> table_b by the same join criteria to avoid rescans.
>
IIRC the index-based sampling (described in the paper I mentioned) works
something like this - sample leading table, then use indexes to lookup
data from the other tables to build a statistically correct sample of
the whole join.
> Will what you have work when we want to do an MCV on a mix of local and
> remote columns, or will that require more work?
>
>
> 3. Currently limited to: equality join clauses, equality and IN filter
> clauses, simple Var stats objects (no expressions), inner joins only,
> and two-way joins only. Some of these are easier to extend; others may
> be harder or unnecessary (like n-way joins).
>
>
> I suspect that n-way joins would have very limited utility and could be
> adequately covered with multiple join-stats objects.
>
I see no clear reason why that would be the case, and it's not that hard
to construct joins on 3+ tables where stats on 2-way joins won't be good
enough. Whether those cases are sufficiently common I don't know, but I
also don't see a good reason to not address them - it should not be much
more complex than 2-way joins (famous last words, I know).
>
>
> 4. Patch 0002 (auto-creation from FK constraints) should probably be
> gated behind a GUC. I'm not strongly attached to this patch, but kept
> it because FK joins seem like a natural and common use case.
>
>
> I think this is like indexing, just because you can make all possible
> columns indexed doesn't mean you should. Tooling will emerge to
> determine what join stats objects are worth their weight, and create
> only those objects.
>
Agreed.
> If there's interest, I'm happy to continue iterating on the design.
>
> In particular, I'd welcome feedback on:
> - whether this is a direction worth pursuing,
>
>
> Yes. Very much so.
>
>
> - the catalog design,
>
>
> I had somehow gotten the impression that you were going to take the
> extended statistics format, but store individual columns in a modified
> pg_statistic. That's working for now, but I wonder if that will still be
> the case when we start to try this for columns that are arrays, ranges,
> multiranges, tsvectors, etc. pg_statistic has the infrastructure to
> handle those, but there may be good reason to keep pg_statistic focused
> on local attributes and instead just keep adding new kinds to extended
> statistic and let it be the grab-bag it was perhaps always meant to be.
>
> In my own musings on how to implement this (which you have far exceeded
> with this proof-of-concept), I had wondered how to stats for k.keyword
> and k.phonetic_code individually from the definition
> of movie_keywords2_multi_stats, but looking at what you've done I think
> we're better of with defining each statistic object very narrowly, and
> that means we define one object per remote column and one object per
> interesting combination of columns, then so be it. So long as we can
> calculate them all from the same join of the two tables, we'll avoid the
> nasty overhead.
>
Not sure I understand this. Why would this use pg_statistic at all? I
imagined we'd just treat the join as a relation, and store the stats in
pg_statistic_data_ext. The only difference is we need to store info
about the join itself (which rels / conditions), which would go into
pg_statistic_ext.
>
> - and scope (what kinds of joins / predicates are worth supporting).
>
>
> between-ish clauses ( x>= y AND x < z), etc is what immediately comes to
> mind.
>
> element_mcv for array types might be next, but that's well down the road.
Maybe. In v1 we should focus on mimicking what we have for per-relation
stats, and only then try doing something more.
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2026-02-01 16:39:38 | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Previous Message | Tomas Vondra | 2026-02-01 16:19:04 | Re: Is there value in having optimizer stats for joins/foreignkeys? |