| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | Haibo Yan <tristan(dot)yim(at)gmail(dot)com> |
| Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [PATCH] DISTINCT in plain aggregate window functions |
| Date: | 2026-06-26 01:03:43 |
| Message-ID: | CAAAe_zCfVbNrOUF5FaE38EcwYbtXacThR9xdVe2UJJPeDSL2KA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Haibo,
This is an initial, high-level pass over v2 (patches 1-3), not a detailed
code review -- I'll follow up with more as time permits. For this round I
wanted to confirm the shape of the feature and the cases it rejects, flag
where its behavior diverges from non-window DISTINCT, note test coverage,
and check one interaction. A short list of suggestions is at the end.
Overall the approach looks reasonable, and for the shapes I tried the
per-row values matched equivalent non-window formulations (correlated /
GROUP BY rewrites, and Oracle for the whole-partition forms). The points
below are mostly about matching our own non-window DISTINCT semantics, not
about the patch being incorrect.
1. Added surface
----------------
The series makes agg(DISTINCT x) OVER (...) accepted for plain aggregate
window functions (count/sum/avg/...), which PostgreSQL currently rejects.
DISTINCT is carried through parse and deparse so it survives view
definitions, and execution is enabled for a defined subset of frames.
2. Behavior by frame shape
--------------------------
Execution uses one of two strategies depending on the frame:
- Whole-partition frames (the frame is effectively the entire
partition): the distinct result is the same for every row, so it is
computed once and reused.
- Running / non-shrinking (grow-only) frames (start at UNBOUNDED
PRECEDING, no EXCLUDE): the distinct set only grows as rows enter, so it
is maintained incrementally per row.
3. Cases that are (intentionally) rejected
------------------------------------------
- DISTINCT on an aggregate that takes more than one argument (roadmap 7)
- frames that don't start at UNBOUNDED PRECEDING, or that use EXCLUDE
(roadmap 4-6)
- in the running case, an argument type that is not hashable
4. Behavior-level observations
------------------------------
Compared with GROUP BY agg(DISTINCT x), three things differ between the two
frame strategies:
- Type support: a type that is sortable but not hashable is accepted in
the whole-partition case yet rejected in the running case, whereas
GROUP BY DISTINCT accepts it (e.g. count(DISTINCT x::money)).
- Value order: for order-sensitive aggregates (e.g. array_agg(DISTINCT
...)), the order in which distinct values are aggregated differs
between the two frame strategies, so the same expression can produce
differently-ordered results.
- Memory: the running (hash) path is the only one of these that holds
its distinct set in an unbounded in-memory structure; everything else,
including the patch's own whole-partition path, is work_mem-bounded and
spills to disk:
whole-partition (this patch) sort work_mem-bounded, spills
running (this patch) hash unbounded, no spill
GROUP BY agg(DISTINCT x) sort work_mem-bounded, spills
SELECT DISTINCT, HashAgg hash work_mem-bounded, spills
SELECT DISTINCT, Sort+Unique sort work_mem-bounded, spills
So a large, high-cardinality partition in the running case can grow the
hash table without bound -- whereas every other path here, including the
hash-based HashAgg, is work_mem-bounded and spills.
5. Test coverage (measured)
---------------------------
I ran line coverage on the changed executor code. The patch's added and
changed lines are about 77% covered (219/284); the file as a whole is
around 91%. The uncovered ~23% of the new lines is not spread out -- it
concentrates in a few behaviors the functional tests never exercise,
because those tests center on count/sum/avg over integer arguments:
- strict aggregates (max/min DISTINCT), i.e. the strict/NULL-seed path
- pass-by-reference argument and transition types (text/numeric)
- (plus a couple of effectively unreachable guards)
A few more gaps don't show up as uncovered lines: RANGE frames with peer
rows (tied ORDER BY keys), GROUPS mode, and order-sensitive aggregates
(per the value-order point above).
6. Interaction with row pattern recognition (RPR)
-------------------------------------------------
I applied this series on top of the RPR patch set and built the two
together. They combine cleanly -- no textual (rebase) conflict, and the
regression tests pass without any change to code or tests.
That they coexist is structural, not luck: RPR requires the frame to start
at CURRENT ROW, while this feature requires UNBOUNDED PRECEDING, so a single
window can satisfy at most one of them -- a combined query is always
rejected by one side or the other. As long as these two frame requirements
stay mutually exclusive, I would expect the two to merge without trouble.
One asymmetry worth noting: the RPR side's CURRENT-ROW requirement follows
from the standard's definition of row pattern recognition, whereas this
feature's UNBOUNDED-PRECEDING requirement is one the roadmap plans to relax
(the sliding-frame patches). So the open question is whether that
relaxation will ever reach frames that start at CURRENT ROW -- the exact
shape RPR requires, and the point at which the two features would meet. Is
CURRENT-ROW-start support part of the plan?
7. Suggestions
--------------
- Type support: the planner should decide sort vs hash from the argument
type, and the executor should just carry that out -- a non-hashable type
would then route to sort instead of being rejected at startup.
- Memory: the running hash seen-set grows without bound. It should spill
at work_mem the way HashAgg already does for grouping; hash can stay the
default for the common hashable case.
- Tests: fill the coverage gaps in section 5 -- strict aggregates,
pass-by-reference types, RANGE peers, GROUPS mode. I can provide cases
with cross-checked expected values.
- Documentation: there are no SGML changes for this new form. The
window-function-call synopsis doesn't show the DISTINCT form, and no
prose mentions it.
- Roadmap ordering (just a thought): a grow-only frame is the add-only
special case of the sliding machinery in patches 4-6, and the hash path
is where the section-4 divergences live -- so patch 3 reads more like an
optimization than a foundation.
Regards,
Henson
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Henson Choi | 2026-06-26 01:23:53 | Re: Row pattern recognition |
| Previous Message | Fujii Masao | 2026-06-26 00:51:16 | Re: enhance wraparound warnings |