Re: Document aggregate functions better w.r.t. ORDER BY

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Document aggregate functions better w.r.t. ORDER BY
Date: 2023-10-26 05:34:10
Message-ID: CAKFQuwadWwDfyA5O4v92CCvDb=hKCGY97pq8wrH=dXFX84WnCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 25, 2023 at 7:13 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Thu, 26 Oct 2023 at 13:10, David G. Johnston
> <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> > Question: Do you know whether we for certain always sort ascending here
> to compute the unique values or whether if, say, there is an index on the
> column in descending order (or ascending and traversed backwards) that the
> data within the aggregate could, with an order by, be returned in
> descending order?
>
> The way it's currently coded, we seem to always require ascending
> order. See addTargetToGroupList(). The call to
> get_sort_group_operators() only requests the ltOpr.
>
> A quick test creating an index on a column with DESC shows that we end
> up doing a backwards index scan so that we get the requested ascending
> order:
>
> create table b (b text);
> create index on b (b desc);
> explain select string_agg(distinct b,',') from b;
> QUERY PLAN
>
> ------------------------------------------------------------------------------------------
> Aggregate (cost=67.95..67.97 rows=1 width=32)
> -> Index Only Scan Backward using b_b_idx on b (cost=0.15..64.55
> rows=1360 width=32)
> (2 rows)
>
> However, I think we'd best stay clear of offering any guarantees in
> the documents about this. If we did that it would be much harder in
> the future if we wanted to implement the DISTINCT aggregates by
> hashing.
>
> So, I think we are mischaracterizing the Standard here, if only in the
specific case of array_agg.

SQL Standard: 4.16.4

Every unary aggregate function takes an arbitrary <value expression> as the
argument; most unary aggregate
functions can optionally be qualified with either DISTINCT or ALL.

If ARRAY_AGG is specified, then an array value with one element formed from
the <value expression>
evaluated for each row that qualifies.

Neither DISTINCT nor ALL are allowed to be specified for VAR_POP, VAR_SAMP,
STDDEV_POP, or
STDDEV_SAMP; redundant duplicates are not removed when computing these
functions.

10.9

<array aggregate function> ::=
ARRAY_AGG
<left paren> <value expression> [ ORDER BY <sort specification list> ]
<right paren>

I would reword the existing note to be something like:

The SQL Standard defines specific aggregates and their properties,
including which of DISTINCT and/or ORDER BY is allowed. Due to the
extensible nature of PostgreSQL it accepts either or both clauses for any
aggregate.

From the most recent patch:

<para>
- If <literal>DISTINCT</literal> is specified in addition to an
- <replaceable>order_by_clause</replaceable>, then all the
<literal>ORDER BY</literal>
- expressions must match regular arguments of the aggregate; that is,
- you cannot sort on an expression that is not included in the
- <literal>DISTINCT</literal> list.
+ If <literal>DISTINCT</literal> is specified with an
+ <replaceable>order_by_clause</replaceable>, <literal>ORDER
+ BY</literal> expressions can only reference columns in the
+ <literal>DISTINCT</literal> list. For example:
+<programlisting>
+WITH vals (v1, v2) AS ( VALUES (1,'Z'),(3,'D'),(4,'R'),(3,'A'),(2,'T') )
+SELECT array_agg(DISTINCT v2 ORDER BY v2 DESC) FROM vals;
+ array_agg
+-------------
+ {Z,T,R,D,A}
+</programlisting>

The change to a two-column vals was mostly to try and find corner-cases
that might need to be addressed. If we don't intend to show the error case
of DISTINCT v1 ORDER BY v2 then we should go back to the original example
and just add ORDER BY v DESC. I'm fine with not using string_agg here.

+ For example:
+<programlisting>
+WITH vals (v) AS ( VALUES (1),(3),(4),(3),(2) )
+SELECT array_agg(DISTINCT v ORDER BY v DESC) FROM vals;
+ array_agg
+-----------
+ {4,3,2,1}
+</programlisting>

We get enough complaints regarding "apparent ordering" that I would like to
add:

As a reminder, while some DISTINCT processing algorithms produce sorted
output as a side-effect, only by specifying ORDER BY is the output order
guaranteed.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-10-26 06:13:31 Re: Container Types
Previous Message Jeff Davis 2023-10-26 05:06:23 Re: Is this a problem in GenericXLogFinish()?