Re: Window functions patch v04 for the September commit fest

From: "Hitoshi Harada" <umi(dot)tanuki(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Subject: Re: Window functions patch v04 for the September commit fest
Date: 2008-09-02 04:46:46
Message-ID: e08cc0400809012146m5da0ecebqbba6d1ceecdf42f4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for comments,

2008/9/2 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
>
> Ok, I'm starting to read up on SQL2003 window functions, and to review this
> patch. Here's a long brain-dump of my first thoughts on the design.
>
> Let's list a couple of requirements first:
>
> 1. It's important that what gets committed now can be extended to handle all
> of the window function stuff in SQL2003 in the future, as well as
> user-defined-window-functions in the spirit of PostgreSQL extensibility.
> Even if we don't implement all of it in this release.
>
> 2. Performance needs to be reasonable, in the face a large number of input
> rows. These are typically used in OLAP queries that process gazillions of
> rows. Some window functions need access to all rows in the window frame or
> partition, so you can't reasonably expect great performance with those if
> the working set is large, but those that don't, should work with finite
> memory requirements, and reasonably fast.
>
> Because of 1., let's focus on the interface for
> user-defined-window-functions first. Ideally, the interface is such that all
> the window functions and aggregates defined in SQL2003, and others we might
> want to have in core or as pgfoundry projects, can be implemented using it.
>
> I think there's still some confusion in the design document about what a
> frame is. The terminology section is great, it helped me a lot to get
> started, so thank you for that. However, in the "Actual design in the patch"
> you describe how a window function works, and you talk about a window frame,
> but elsewhere in the doc you note that window frames are not implemented
> yet.

Sorry about that. In my understanding, the "Window Frame" is defined
by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
contrast to "Window Partition" defined by "PARTITION BY" clause. A
frame slides within a partition or there's only one frame if those
clauses are not specified. The current patch has not implemented this
yet. I'll update the docs.

>
> As already discussed, there's two different kind of window expressions:
> window aggregates, and ranking aggregates (you call them window functions in
> your doc). It seems that they are different enough that we can't bang them
> together. For example, you can't use a window frame clause with ranking
> aggregates, and ranking aggregates need to see the whole row, whereas window
> aggregates only see the expressions passed as argument.

I don't like to call the second type "ranking aggregates" because it
may refer to only ranking functions though there are more types of
function like ntile(), lead() and lag(). But "window functions"
doesn't seem appropriate either since it's ambiguous with the general
name of window expressions.

>
> API for window functions (aka. ranking aggregate functions)
> ------------------------
>
> Borrowing ideas from the many approaches listed in the design doc, I propose
> an interface using a set-returning function. Similar to Simon's proposal,
> but taking into account that a ranking function might need to see some or
> all input tuples before returning anything.
>
> For each input tuple, the window function can return any number of tuples,
> including 0. After processing all input values, though, the total number of
> output rows returned must match the total number of input values fed into
> it. The function must eventually output one value for each input value, in
> the same order.
>
> This is a flexible design that allows for different kind of implementations;
> the function can return one output tuple for each input tuple (ROW_NUMBER(),
> RANK() etc.), or it can swallow all input tuples first, and only then return
> all output tuples (e.g. CUME_DIST).
>
> Here's a couple of step-by-step examples of how some functions would work.
> Imagine input values 1, 3, 4, 4, 5:
>
> RANK
> ----
>
> Invocation Input Value returned Internal state (after)
> 1 1 1 (last value 1, count 1, rank=1)
> 2 3 2 (last value 2, count 1, rank=2)
> 3 4 3 (last value 4, count 1, rank=3)
> 4 4 3 (last value 4, count 2, rank=3)
> 5 5 5 (last value 5, count 1, rank=5)
>
> So RANK returns one tuple after each input tuple. Just like in your current
> patch, keep the last value returned, a row count, and the current rank as
> state.
>
> LAG
> ---
>
> Internal state is a FIFO queue of values. Each input value is pushed to the
> FIFO, and the tuple that falls out of the queue is returned.
>
> For example, LAG(<col>,2,0):
>
> Invocation Input row Value returned Internal state (after)
> 1 1 0 (1)
> 2 3 0 (3, 1)
> 3 4 1 (4, 3)
> 4 4 3 (4, 4)
> 5 5 4 (5, 4)
>
> LEAD
> ----
>
> Return nothing for first <offset> input tuples. Then return the current
> input tuple for the rest of the input tuples, and after the last input
> tuple, return <offset> number of default values.
>
> For example, LEAD(<col>,2,0):
>
> Invocation Input row Value returned Internal state (after)
> 1 1 <none> (offsetbegin = 1, pad=2)
> 2 3 <none> (offsetbegin = 0, pad=2)
> 3 4 4 (offsetbegin = 0, pad=2)
> 4 4 4 (offsetbegin = 0, pad=2)
> 5 5 5 (offsetbegin = 0, pad=2)
> 6 <none> 0 (offsetbegin = 0, pad=1)
> 7 <none> 0 (offsetbegin = 0, pad=0)
>
>
> For functions like LEAD or CUME_DIST that don't immediately start returning
> values, the executor will need a tuplestore to buffer input rows until it
> gets their values from the window function.
>
>
> The syntax for creating a ranking aggregate might look something like this:
>
> CREATE RANKING AGGREGATE name (
> SFUNC = sfunc,
> STYPE = state_data_type,
> OUTFUNC = outfunc
> )
>
> This is very similar to Simon's proposal, and to current CREATE AGGREGATE,
> but tweaked a bit to fix the problems Hitoshi pointed out.
>
> sfunc is called once for each input row. Unlike with normal aggregates,
> sfunc is passed the whole input row, so that e.g RANK can compare it against
> the previous row, or LEAD can buffer it.
>
> outfunc is a set-returning function, and it is called until it returns no
> more rows, after each sfunc invocation.

Your proposal is smarter than the current implementation. But it
doesn't seem complete in some way. From logical point of view, the
window functions should be allowed to access whole of rows in the
frame the current row belongs to (c.f. inverse distribution
functions). From implementing and performance point of view, there
need to consider such case like mixing window aggregates and ranking
aggregates in the same window, which means it is better that the two
types of functions are processed in the similar flow. Also logically,
SQL spec doesn't forbid ranking aggregates in sliding window frames.
So your design may be flexible enough to cover different requirements
of lead()/lag() but I don't know if it will.

> Window aggregates
> -----------------
>
> Like normal aggregates, window aggregates process N input values, and return
> one output value. In normal GROUP BY expressions, the aggregate function is
> called once each group, but in a window aggregate expression with a window
> frame (aka sliding window), there is as many "groups" as there is rows.
>
> In fact, any normal aggregate function can also be used as a window
> aggregate and vice versa. The trivial implementation is to have a buffer to
> hold all values currently in the window frame, and for each input row,
> invoke the aggregate function over all the rows currently in the frame. I
> think we'll need support that, to allow using any built-in or user-defined
> aggregate as a window aggregate.
>
> However, we also want to provide optimized versions of the common
> aggregates. Aggregates like COUNT, SUM, and AVG can easily be implemented
> more efficiently, without rescanning all the tuples in the group.
>
> I propose an extension to the current CREATE AGGREGATE syntax to allow for
> optimized versions. Currently, the syntax is like:
>
> CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
> SFUNC = sfunc,
> STYPE = state_data_type
> [ , FINALFUNC = ffunc ]
> [ , INITCOND = initial_condition ]
> [ , SORTOP = sort_operator ]
> )
>
> Let's add a function to allow removing tuples from the current window frame:
>
> CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
> SFUNC = sfunc,
> STYPE = state_data_type
> [ , FINALFUNC = ffunc ]
> [ , INITCOND = initial_condition ]
> [ , SORTOP = sort_operator ]
> [ , SUBTRACTFUNC = subfunc,
> )
>
> subfunc is like sfunc, except that the input value is "subtracted" from the
> current value. On each input row, the executor calls sfunc for all new
> tuples that enter the window frame, and subfunc for all tuples that exit it.
> Another difference is that finalfunc can be called many times during the
> lifecycle of an aggregate invocation. For example, the subfunc for COUNT
> would decrement the row count by one, and for SUM, subfunc would subtract
> the removed value from the current sum. Perhaps we should also support an
> "opportunistic subfunc", that could either perform the subtraction, or
> return a special value indicating that it can't be done, and we need to
> calculate the aggregate from scratch as if subfunc wasn't defined. That
> would be useful for MIN/MAX, which can be implemented by just keeping track
> of the current MIN/MAX value, and "subtracting" any other value than the
> current MIN/MAX is a no-op, but "subtracting" the current MIN/MAX value
> requires scanning all the rest of the tuples from scratch.

Hmmm, its name may be better as "invfunc" than subfunc. I feel
horrible in imagining to manage the combination of functions that
don't support subfunc, ones that does and ranking aggregates but it's
possible.

> PS. Have you looked at the "hypothetical set functions" in SQL2003?

I had a glance but not deeply, since I found I would be lost in design
if deeply diving into it. :)

--
Hitoshi Harada

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-09-02 04:57:22 Re: Mysterious Bus Error with get_fn_expr_argtype()
Previous Message Brendan Jurd 2008-09-02 04:35:55 Re: Mysterious Bus Error with get_fn_expr_argtype()