| From: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Cc: | hs(at)cybertec(dot)at, Jeff Davis <pgsql(at)j-davis(dot)com> |
| Subject: | Is there value in having optimizer stats for joins/foreignkeys? |
| Date: | 2025-12-01 20:10:25 |
| Message-ID: | CADkLM=cUwMftPLFq0iD6-qKRyNiRM2HZGYVp6=0noxA8GfuEtA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Threads like [1] and [2] have gotten me thinking that there may be some
value in storing statistics about joins.
For the sake of argument, assume a table t1 with a column t2id which
references the pk of table t2 that has columns t2.t2id, t2c1, t2c2, t2c3.
In such a situation I can envision the following statistics being collected:
* The % of values rows in t2 are referenced at least once in t1
* The attribute stats (i.e. pg_statistic stats) for t2c1, t2c2, t2c3, but
associated with t1 and weighted according to the frequency of that row
being referenced, which means that values of unreferenced rows are filtered
out entirely.
* That's about it for direct statistics, but I could see creating extended
statistics for correlations between a local column value and a remote
column, or expressions on the remote columns, etc.
The storage feels like it would be identical to pg_statistic but with a
"starefrelid" field that identifies the referencing table.
That much seems straightforward. A bigger problem is how we'd manage to
collect these statistics. We could (as Jeff Davis has suggested) keep our
tablesamples, but that wouldn't necessarily help in this case because the
rows referenced, and their relative weightings would change since the last
sampling. In a worst-case scenario, We would have to sample the joined-to
tables as well,and that's an additional burden on an already IO intensive
operation.
In theory, we could do some of this without any additional stats
collection. If the ndistinct of t1.t2id is, say, at least 75+% of the
ndistinct of t2.t2id, we could just peek at the attribute stats on t2 and
use them for estimates. However, that makes some assumptions that the stats
on t2 are approximately as fresh as the stats on t1, and I don't think that
will be the case most of the time.
CCing people who have wondered out loud about this topic within earshot of
me.
Thoughts?
[1]
https://www.postgresql.org/message-id/flat/6fdc4dc5-8881-4987-9858-a9b484953185%40joeconway.com#5a93cd7a730691843a7700c770397baf
[2]
https://www.postgresql.org/message-id/flat/tencent_3018762E7D4C9BC470C821C829C1BF2F650A%40qq.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2025-12-01 20:28:09 | Re: Buffer locking is special (hints, checksums, AIO writes) |
| Previous Message | Alexander Lakhin | 2025-12-01 20:00:01 | Re: Improving tracking/processing of buildfarm test failures |