Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: David Fetter <david(at)fetter(dot)org>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Subject: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Date: 2013-07-18 03:15:14
Message-ID: 2b8b55b8ba82f83ef4e6070b95fb0cd0@news-out.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The spec defines two types of aggregate function classed as "ordered set
function", as follows:

1. An "inverse distribution function" taking one argument (which must be
a grouped column or otherwise constant within groups) plus a sorted
group with exactly one column:

=# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...

The motivating example for this (and the only ones in the spec) are
percentile_cont and percentile_disc, to return a percentile result
from a continuous or discrete distribution. (Thus
percentile_cont(0.5) within group (order by x) is the spec's version
of a median(x) function.)

2. A "hypothetical set function" taking N arguments of arbitrary types
(a la VARIADIC "any", rather than a fixed list) plus a sorted group
with N columns of matching types:

=# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...

(where typeof(p1)==typeof(q1) and so on, at least up to trivial
conversions)

The motivating example here is to be able to do rank(p1,p2,...) to
return the rank that the specified values would have had if they were
added to the group.

As usual, we do not want to constrain ourselves to supporting only the
specific cases in the spec, but would prefer a general solution.

We (meaning myself and Atri) have an implementation that basically
works, though it is not yet complete, but before taking it any further
we need to resolve the design question of how to represent these two
types of function in the system catalogs. The fact that there are in
effect two parts to the parameter list, which are either independent
(for inverse distribution funcs) or closely related (for hypothetical
set functions), doesn't seem to point to an obvious way to represent
this in pg_proc/pg_aggregate.

I'm not yet satisfied with the method used in our implementation, so
we're throwing this open for suggestions. We will post the
work-in-progress patch along with a description of its current
implementation shortly.

One of the major complications is that we ideally want to be able to do
polymorphism based on the type of the sorted group, specifically in
order to be able to do

percentile_disc(float8) within group (order by anyelement)

returning anyelement. (i.e. we should be able to get a discrete
percentile from any type that is orderable.) The question here is how to
resolve the return type both of the aggregate itself and of the finalfn.

We've also had an expression of interest in extending this to allow
percentile_disc(float8[]) and percentile_cont(float8[]) returning
arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
array containing the bounds, median and quartiles in one go. This is an
extension to the spec but it seems sufficiently obviously useful to be
worth supporting.

Comments?

--
Andrew (irc:RhodiumToad)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2013-07-18 03:34:08 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Previous Message Greg Smith 2013-07-18 02:58:16 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)