| From: | Richard Guo <guofenglinux(at)gmail(dot)com> |
|---|---|
| To: | Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Propagate stadistinct through GROUP BY/DISTINCT in subqueries and CTEs |
| Date: | 2026-04-13 01:18:42 |
| Message-ID: | CAMbWs49rWYrecgreDhKsfx3VSDW=qo35s+iAmgGu=wpARrM8_g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
I complained in [1] that some TPC-DS queries suffer from very poor
cardinality estimates on CTE scan filters, to the point that simply
disabling nestloop makes some queries run hundreds of times faster.
Here's a simple reproduction:
create table t (a int, b int, c int);
insert into t select i%2, i, i from generate_series(1,1000) i;
analyze t;
explain analyze
with cte as (select a, b, avg(c) as avg from t group by a, b)
select * from cte t1, cte t2
where t1.a = 1 and t2.a = 1 and t1.avg = t2.avg;
Column 'a' has only 2 distinct values, so the filter a=1 on the
1000-row CTE output should estimate ~500 rows (assuming these values
are equally common). Instead, the CTE scan estimates 5 rows (1000 *
1/200) because examine_simple_variable returns early when the subquery
has GROUP BY, and selectivity estimation falls back on
1/DEFAULT_NUM_DISTINCT.
-> CTE Scan on cte t1 (cost=0.00..22.50 rows=5 width=40)
(actual time=4.874..5.053 rows=500.00 loops=1)
As a result, this query ends up with a Nested Loop plan, and the
Execution Time is 192.907 ms.
For DISTINCT or GROUP BY key columns that are simple Vars, I think we
can propagate stadistinct from the base table, because the set of
distinct values is preserved after grouping. MCV frequencies,
histograms, and correlation data are not valid since GROUP BY and
DISTINCT change the frequency distribution, but with stadistinct
alone, callers like var_eq_const() can use a 1/ndistinct estimate
rather than 1/DEFAULT_NUM_DISTINCT.
Attached is a patch to do this. With the patch, the example above
estimates 500 rows ...
-> CTE Scan on cte t1 (cost=0.00..22.50 rows=500 width=40)
(actual time=3.785..4.143 rows=500.00 loops=1)
... and chooses a Hash Join, with an Execution Time of 8.238 ms (~20x
faster).
I tested this patch on TPC-DS query 31:
-- on master:
Planning Time: 5.207 ms
Execution Time: 1536140.258 ms
-- on patched:
Planning Time: 5.140 ms
Execution Time: 1149.482 ms
Over 1300x faster.
Does this approach make sense? Any thoughts?
[1] https://postgr.es/m/CAMbWs4-QU_nkFqFZLdzWRsEsVE8aLWx4qBBVq7g4rXw+cvYDMg@mail.gmail.com
- Richard
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Propagate-stadistinct-through-GROUP-BY-DISTINCT-i.patch | application/octet-stream | 13.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Shinoda, Noriyoshi (PSD Japan FSI) | 2026-04-13 01:28:55 | RE: [Proposal] Expose internal MultiXact member count function for efficient monitoring |
| Previous Message | David Rowley | 2026-04-13 01:18:11 | Re: StringInfo fixes, v19 edition. Plus a few oddities |