Quick Links

Re: Is there value in having optimizer stats for joins/foreignkeys?

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: Is there value in having optimizer stats for joins/foreignkeys?
Date:	2026-06-22 19:03:33
Message-ID:	a6b82427-9806-496a-9e6a-4ac7edaac6f2@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Alexandra,

I finally had a bit of time to take a look at this again today. AFAIK
nothing was posted since pgconf.dev last month, so I took a look at the
v7 to do at least something. I assume you may have an updated patch, but
chances are it's not too different.

First, it's great you keep working on this. I really hope we can get
something done for PG 20, I'll try to help with that.

Here's a couple review comments, in no particular order - it's simply in
the order 'git difftool' presented me the code in. Also, some of this is
definitely nitpicky / cosmetic.

1) statsmds.c

Does it make sense to check list_length(stmt->relations) at all?
Consider this code:

/*
* Examine the FROM clause. We support either: 1. Single RangeVar for
* single-table statistics 2. JoinExpr for join statistics
*/
if (list_length(stmt->relations) != 1)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only a single relation or JOIN is ...")));

I guess when I wrote that code, I imagined we'd get either a list with a
single single rel (for single-rel stats), or a list of rels (for a
join). But clearly, that does not happen. We seem to get a single-item
list with a RangeVar or JoinExpr, right? So maybe the check is wrong?

2) statsmds.c

Isn't this check overly strict? Shouldn't we complain only in ambiguous
cases, with the same column in multiple tables?

/* Join stats require table-qualified column names */
if (isjoin)
ereport(ERROR,
(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
errmsg("join statistics require table-qualified ...")));

3) statscmds.c

Hmmm, we're now using only MCVs for joins, so this restriction makes
sense. How temporary is this restriction? I was hoping we might use some
of the stats for other purposes (e.g. ndistinct would be useful for
estimating GROUP BY on top of a join - or is that impossible?).

Of course, we should not relax the restriction only when we actually
support those cases. Just a thought.

/*
* For join statistics, only MCV is currently supported.
*/
if (isjoin && (build_ndistinct || build_dependencies ||
build_expressions))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("only MCV statistics are supported for join ...

4) statscmds.c

I'm a bit confused by the code in CreateStatistics, and how it copies
some of the code between the isjoin and !isjoin branches. But not all
code. For example in this block

else if (IsA(selem->expr, Var)) /* column reference in parens */

it has this

/* Treat virtual generated columns as expressions */
if (get_attgenerated(relid, var->varattno) ==
ATTRIBUTE_GENERATED_VIRTUAL)

but only in the !isjoin branch. What happens with generated columns in
the isjoin branch?

5) statscmds.c?

/* Join stats only support simple column references */
if (isjoin)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("expressions are not supported in join ...")));

Presumably that's just a temporary limitation, to keep the patch simple?
And it'll eventually go away, esp. if we end up merging the keys and
exprs into a single list, as discussed in Vancouver.

6) statscmds.c

I don't understand how the detection of duplicate columns (after this
comment) works for the join case, without doing any qsort (like the
single-table branch does).

* For join stats, preserve user-declared column order and check for
* duplicate (varno, attnum) pairs.

7) statscmds.c

I'm not sure I understand this bit:

* If there are no dependencies on a column, give the statistics object
* an auto dependency on the whole table. In most cases, this will be
* ...
* Join stats are excluded because they span two tables (no single
* "whole table" exists).

Why would there be no dependencies on a column - maybe when there are
just expressions? But join stats don't allow that right now, right? So
when would that happen? And if we allow expressions for join stats, what
would happen for joins stats? Or is it OK to just not have a dependency?

8) statsmds.c

In general, some of the code is getting a bit too hard to read, with the
next isjoin and !isjoin branches. I get how we got this code, but it'd
be nice to refactor / clean this up a bit in some way. Perhaps it'd help
if some of the code was moved to separate "helpers" functions, to keep
the "main" functions (like CreateStatistics) somewhat readable?

9) costsize.c

adds the include in the wrong place ;-)

10) clausesel.c

I'm a bit confused by this comment in clauselist_selectivity_ext:

* Note: FK-based joins are already handled by
* get_foreign_key_join_selectivity() which runs before
* clauselist_selectivity(). This primarily benefits:

Why does this matter here that FK joins were already handled?

Also, for the join stats we consider both the baserestrictinfos and the
join clause together, in case the MCV covers both (and there's some sort
of correlation). I suppose the FK joins don't consider that info, so
those estimates miss the information?

12) plancat.c

I'd ditch the "_list" suffixes from the local variables. I don't think
it helps, it just makes everything longer:

/* Join statistics fields */
List *joinrels_list = NIL;
List *keyattrs_list = NIL;
List *keyrefs_list = NIL;
List *joinconds = NIL;

13) random thought - How does this handle non-inner joins? Or does it
make sense only for inner joins?

14) extended_stats.c

I don't think StatExtEntry should have two ways to represent the same
information - I mean, we have a Bitmapset of columns for single-table
stats, but that does not work for join stats, so the patch adds a bunch
of new fields.

IMHO if the old way does not work for joins, maybe we should ditch it
entirely and use the new approach even for single-rel code. Probably
switch in a separate preparatory commit, before adding the join stats.

Also, I don't love having multiple "coordinated" lists - one for attnums
and one for varnos. Wouldn't it be better to have a single list, with
elements being a new struct with varno + varattnum fields (or whatever
else we need)? Maybe there even is a struct we could reuse?

15) ruleutils.c

I suspect this code in pg_get_statisticsobj_worker works right when the
relation name happens to be long (NAMEDATALEN-1 characters). Won't that
truncate the generated string in "buf"?

if (strcmp(all_aliases[i], all_aliases[j]) == 0)
{
char buf[NAMEDATALEN];

suffix++;
snprintf(buf, sizeof(buf), "%s_%d",
get_relation_name(all_reloids[i]), suffix);
all_aliases[i] = pstrdup(buf);

/* Restart scan to check for further conflicts */
j = -1;
}

That's all I have right now. I haven't reviewed the code in join_mcv.c
in detail yet. That's a rather dense code with not that many comments,
so I want to be 100% sure I'm not reviewing stale code.

regards

--
Tomas Vondra

In response to

Re: Is there value in having optimizer stats for joins/foreignkeys? at 2026-05-15 15:30:31 from Alexandra Wang

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Corey Huinker	2026-06-22 19:08:33	Re: use of SPI by postgresImportForeignStatistics
Previous Message	Zsolt Parragi	2026-06-22 18:56:37	Re: [PATCH] Add pg_get_table_ddl() to reconstruct CREATE TABLE statements