| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Cc: | hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com> |
| Subject: | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date: | 2025-12-01 21:01:58 |
| Message-ID: | ea04ba03-50dd-4de8-a2bd-111f246a2f30@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 12/1/25 21:10, Corey Huinker wrote:
> Threads like [1] and [2] have gotten me thinking that there may be some
> value in storing statistics about joins.
>
> For the sake of argument, assume a table t1 with a column t2id which
> references the pk of table t2 that has columns t2.t2id, t2c1, t2c2,
> t2c3. In such a situation I can envision the following statistics being
> collected:
>
> * The % of values rows in t2 are referenced at least once in t1
> * The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3,
> but associated with t1 and weighted according to the frequency of that
> row being referenced, which means that values of unreferenced rows are
> filtered out entirely.
> * That's about it for direct statistics, but I could see creating
> extended statistics for correlations between a local column value and a
> remote column, or expressions on the remote columns, etc.
>
Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.
Of course, it also depends on how expensive this would be. Maybe it's
cheap enough? No idea.
But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.
FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does
... associated with t1 and weighted according to the frequency of
that row being referenced, which means that values of unreferenced
rows are filtered out entirely.
mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.
> The storage feels like it would be identical to pg_statistic but with a
> "starefrelid" field that identifies the referencing table.
>
> That much seems straightforward. A bigger problem is how we'd manage to
> collect these statistics. We could (as Jeff Davis has suggested) keep
> our tablesamples, but that wouldn't necessarily help in this case
> because the rows referenced, and their relative weightings would change
> since the last sampling. In a worst-case scenario, We would have to
> sample the joined-to tables as well,and that's an additional burden on
> an already IO intensive operation.
>
Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1] on how to do index-based join
sampling efficiently.
> In theory, we could do some of this without any additional stats
> collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
> ndistinct of t2.t2id, we could just peek at the attribute stats on t2
> and use them for estimates. However, that makes some assumptions that
> the stats on t2 are approximately as fresh as the stats on t1, and I
> don't think that will be the case most of the time.
>
> CCing people who have wondered out loud about this topic within earshot
> of me.
>
> Thoughts?
I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.
FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.
regards
[1] https://www.cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Heikki Linnakangas | 2025-12-01 21:20:16 | Re: Bug in amcheck? |
| Previous Message | Melanie Plageman | 2025-12-01 20:41:07 | Re: Buffer locking is special (hints, checksums, AIO writes) |