Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Date: 2013-07-18 19:40:04
Message-ID: 51E84494.2030508@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/17/2013 08:15 PM, Andrew Gierth wrote:
> The spec defines two types of aggregate function classed as "ordered set
> function", as follows:
>
> 1. An "inverse distribution function" taking one argument (which must be
> a grouped column or otherwise constant within groups) plus a sorted
> group with exactly one column:
>
> =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
>
> The motivating example for this (and the only ones in the spec) are
> percentile_cont and percentile_disc, to return a percentile result
> from a continuous or discrete distribution. (Thus
> percentile_cont(0.5) within group (order by x) is the spec's version
> of a median(x) function.)

One question is how this relates to the existing

SELECT agg_func(x order by y)

... syntax. Clearly there's some extra functionality here, but the two
are very similar conceptually.

> 2. A "hypothetical set function" taking N arguments of arbitrary types
> (a la VARIADIC "any", rather than a fixed list) plus a sorted group
> with N columns of matching types:
>
> =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
>
> (where typeof(p1)==typeof(q1) and so on, at least up to trivial
> conversions)
>
> The motivating example here is to be able to do rank(p1,p2,...) to
> return the rank that the specified values would have had if they were
> added to the group.

Wow, I can't possibly grasp the purpose of this. Maybe a practical example?

> We've also had an expression of interest in extending this to allow
> percentile_disc(float8[]) and percentile_cont(float8[]) returning
> arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
> array containing the bounds, median and quartiles in one go. This is an
> extension to the spec but it seems sufficiently obviously useful to be
> worth supporting.

To be specific, I asked for this because it's already something I do
using PL/R, although in PL/R it's pretty much limited to floats.

Anyway, for anyone who isn't following why we want this: statitical
summary reports. For example, I'd love to be able to do a quartile
distribution of query execution times without resorting to R.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-07-18 19:40:24 Re: Improvement of checkpoint IO scheduler for stable transaction responses
Previous Message Fujii Masao 2013-07-18 19:26:05 Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])