| From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
|---|---|
| To: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
| Cc: | PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Remove redundant DISTINCT when GROUP BY already guarantees uniqueness |
| Date: | 2026-06-12 03:19:07 |
| Message-ID: | CAApHDvqZ3qgcCL1RFrhfZ8qM3VNdSP3PDHAJbbzrLcKaLJCaDw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, 8 Jun 2026 at 23:12, Ilia Evdokimov
<ilya(dot)evdokimov(at)tantorlabs(dot)com> wrote:
> When a query contains both DISTINCT and GROUP BY, planner currently
> always creates a separate plan node to eliminate duplicate output rows.
> However, if GROUP BY is present, each output row already corresponds to
> exactly one group - meaning the rows produced by GROUP BY are already
> unique on their group keys. Any DISTINCT clause that covers all those
> group keys therefore adds no filtering and the de-duplication step is
> pure overhead.
I think only doing something just for that case is likely too narrow a scope.
In [1], I wrote:
> Sometime in the future, I'd like to see some sort of UniqueKeys List
> in RelOptInfo which is initially populated by using looking at the
> unique indexes during build_simple_rel(). The idea is that these
> UniqueKeys would get propagated into RELOPT_JOINRELs when the join
> condition matches a set of unique keys on the other side of the join.
> e.g. If the outer side of the join had a UniqueKey matching the join
> condition then we could take all of the UniqueKeys from the RelOptInfo
> for the inner side of the join (and vice versa). This infrastructure
> would allow us to no-op a DISTINCT when the RelOptInfo has targetlist
> items which completely cover at least one UniqueKey set, or not bother
> performing a sort for a GROUP BY when the GROUP BY clause is covered
> by a UniqueKey.
There was some work on that by Andy Fan after I wrote a prototype for
it. I didn't have much spare bandwidth to take a serious look at that
patch at the time. However, I still think it's the best way to solve
this as it allows the relevant UniqueKeys to be propagated up each
join level and we'll know exactly what the results are unique on at
each join level. Doing this would allow many more optimisations than
just skipping redundant DISTINCT columns or operations.
One example is: SELECT t1.id,sum(t1.col) FROM t1 INNER JOIN t2 ON
t1.t2_id = t2.unique_column GROUP BY t1.id; This would be able to do
a simple Group Aggregate without sorting as there's guaranteed to be 1
row per group.
It also allows the unique join optimisation to be based on the unique
keys rather than calculating that from looking at unique indexes or
the properties of a subquery.
I believe the latest work on UniqueKeys is in [2].
David
[1] https://postgr.es/m/CAKJS1f-uSUr9dw6j48S5g3zMS=w-vp2oCNZzfriW2yVLAXj9iw@mail.gmail.com
[2] https://postgr.es/m/flat/7mlamswjp81p.fsf%40e18c07352.et15sqa
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ewan Young | 2026-06-12 03:36:03 | Fast-path FK checks reject valid inserts for domain-typed FK columns |
| Previous Message | Michael Paquier | 2026-06-12 03:18:10 | Re: [PATCH] Fix some typos in code comments |