Re: Window functions patch v04 for the September commit fest

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Window functions patch v04 for the September commit fest
Date: 2008-09-02 12:51:09
Message-ID: 48BD36BD.2070209@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hitoshi Harada wrote:
> 2008/9/2 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
> In my understanding, the "Window Frame" is defined
> by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
> contrast to "Window Partition" defined by "PARTITION BY" clause. A
> frame slides within a partition or there's only one frame if those
> clauses are not specified. The current patch has not implemented this
> yet. I'll update the docs.

Yes, that's how I read it as well. Another way to think of it is that
there's always a window frame, but if no ROWS BETWEEN or similar clause
is given, the window frame spans the whole partition (or from the 1st
row of the partition, up to the current row, if ORDER BY is given).

> I don't like to call the second type "ranking aggregates" because it
> may refer to only ranking functions though there are more types of
> function like ntile(), lead() and lag(). But "window functions"
> doesn't seem appropriate either since it's ambiguous with the general
> name of window expressions.

Yep, confusing :-(. The SQL standard says that "a window function is one
of: a rank function, a distribution function, a row number function, a
window aggregate function, the ntile function, the lead function, the
lag function, the first-value function, the last-value function, the
nth-value function". So, window aggregate functions are a special class
of window functions, and there's no term to refer to all the rest of the
window functions excluding window aggregate functions.

Your doc SQL spec
Window expression Window function
Window function Any window function other than a window aggregate function
Window aggregate Window aggregate function

I tried to coin the term "ranking aggregate" for the SQL2008 term "Any
window function other than a window aggregate function", but you're
right that that's still confusing, because the SQL2008 term "rank
function" includes only RANK() and DENSE_RANK().

The spec calls them "group aggregate functions", when they're used with
GROUP BY, rather than as a window function. I think we could use that term.

> Your proposal is smarter than the current implementation. But it
> doesn't seem complete in some way. From logical point of view, the
> window functions should be allowed to access whole of rows in the
> frame the current row belongs to (c.f. inverse distribution
> functions).

By the whole of rows, do you mean
a) the chosen value or expression of all rows, or
b) all columns of the input rows?

Different window functions have different needs. RANK() for example does
need to see all columns, to compare them, but it only needs access to
the previous and current row. CUME_DIST on the other hand needs access
to all columns of all rows, and LAG needs access to a specific column of
a fixed number of rows. And row_number needs nothing.

The needs of access to the rows are so different that it seems best to
me to delegate the buffering to the window function.

Actually, looking closer to the ranking functions, they don't really
need access to all columns. They just need to be able to compare them,
according to the window ordering, so we should probably only provide
access to the arguments of the aggregate, evaluated for any row in the
frame, and a comparator function, that can determine if any given rows
in the frame are <, = or >.

> From implementing and performance point of view, there
> need to consider such case like mixing window aggregates and ranking
> aggregates in the same window, which means it is better that the two
> types of functions are processed in the similar flow. Also logically,
> SQL spec doesn't forbid ranking aggregates in sliding window frames.

It doesn't? Oh, bummer. It seems we need a grand unified theory of
ranking and other aggregates then :-(. How does something like RANK work
with a window frame? What does it return, if the current row is excluded
from the window frame, e.g with EXCLUDE CURRENT ROW?

Let's look at the trivial, generic, and slow implementation first, and
then figure out what tricks we can do to make it faster. I gather that
that generic algorithm, following how the SQL spec defines the window
frame, looks like this:

0. Construct the ordered set of tuples, containing all the tuples in the
partition.
For each row, start with the ordered set constructed in step 0, and:
1. Remove all tuples from the set that are before the start of the
sliding frame
2. Remove all tuples from the set that are after the end of the sliding
frame
3. Remove all tuples specified by the window-frame-exclusion clause
(EXCLUDE CURRENT ROW/GROUP/TIES/NO OTHERS).
4. Run the aggregate over all tuples still in the set.

Now, the first obvious optimization is that we don't want to construct
the window frame from scratch for each row. Instead:

0. Start with an empty ordered set S.
For each row:
1. Read forward until we hit the end of the new sliding frame, and add
those rows to S.
2. Remove all tuples from S that are before the start of the sliding frame.
3. Construct a new set T, which is the same as S, but all tuples
specified by the window-frame-exclusion clause (EXCLUDE CURRENT
ROW/GROUP/TIES/NO OTHERS) are removed.
5. Run the aggregate over all tuples in T.

This is applicable to all ranking and window aggregates. Per spec, if no
window-frame-clause is given, the start of the frame is always the 1st
row in the partition. If an ORDER BY is given, the end of the frame is
the the current row. Otherwise it's the end of partition, which means
that we can just run the aggregate once and return the same value for
all rows in the partition.

Now, let's consider how to speed up step 5. There's the difference
between ranking and window aggregates. Window aggregates don't care
about the position of the current row within the window frame, so the
interface I proposed earlier, with the subfunc, would work fine. But for
ranking aggregates, the position of the current row matters.

Another must-have optimization is that we must be able to eagerly
discard rows that we know won't be accessed anymore. And likewise, we
mustn't read ahead many rows that we don't need to calculate the
aggregate for the current row.

I think I'm coming around to your suggestion of passing a Window object
to the aggregate function, allowing random access to the current frame.
As Simon pointed out, there needs to be a way for the aggregate function
to signal when it doesn't need some rows in the frame anymore, for
performance.

The sliding window frame always moves forward, never backwards.
Therefore it's sufficient if the user-defined function can indicate a
cutoff point, allowing the executor node to discard any preceding rows.
However, the window-frame-exclusion clause complicates that, because the
current row and its peers might or might not be in the frame, which
means that rows can disappear and reappear into the middle of the window
frame. We don't need to implement window-frame-exclusion in this
release, but we should consider how that affects the API.

The API could be like this (in pseudo-syntax):

CREATE AGGREGATE name (
<type> evalfunc(Window object)
enterfunc(Window object, int pos)
exitfunc(Window object, int pos)
)

where evalfunc, enterfunc and exitfunc are functions provided by the
aggregate function author.

Window object {
/* Returns the position of current row within the frame. */
int get_current_pos();
/* Returns the number of rows in the frame. */
int get_max_pos();
/* Compares two tuple in the window frame, according to window
ordering */
int compare(int posa, int posb)
/* Returns the value of an aggregate argument for tuple at given
position */
Datum get_arg_value(int pos, int argno);
/* Promise that we won't call get_arg_value or compare for any pos <
cutoff_pos anymore */
void signal_cutoff(int cutoff_pos);
}

The executor calls enterfunc for each row that enters the window frame,
and exitfunc for each row that exits the frame (before removing the
tuple from the frame, so that it's still accessible to get_arg_value()
and compare()). After calling enter/exitfuncs for all rows that
enter/exit the frame, evalfunc is called once, to return the value for
the current row.

I believe this allows for an efficient implementation of all the window
functions we've discussed. It works fine for implementing regular
aggregates as well, so there no need to treat ranking and other
aggregates any differently. We can completely replace the current method
of defining aggregates with this if we want to, as long as we provide
some sort of glue for backwards-compatibility with aggregates in
external projects.

Here's sketches of some aggregate implementations using this API:

RANK()
------
enterfunc:
if position of the new row is > current row, do nothing. Otherwise,
decrement current rank.
exitfunc:
if position of the new row is > current row, do nothing. Otherwise,
increment current rank.

LEAD(offset)
------------
evalfunc:
Call get_row(get_current_row() + offset).
Call signal_cutoff(get_current_row());

LAG(offset)
-----------
evalfunc:
Call get_row(get_current_row() - offset).
Call signal_cutoff(get_current_row() - offset);

MIN/MAX
-------
enterfunc:
if new value is </> the current min/max value, replace it with the new
value
exitfunc:
if removed value = the current min/max value, rescan the current frame
for new min/max value, iterating through all values (1..get_max_row()).

COUNT
-----
enterfunc: increment counter
exitfunc: decrement counter

CUME_DIST
---------
CUME_DIST is defined as NP/NR, where NP is the number of rows preceding
or peer with the current row, and NR is the total number of rows in the
frame. NR = get_max_pos(). Maintain NP-counter in enterfunc/exitfunc,
incrementing it whenever a row preceding current row enters or exits the
frame. In evalfunc, increment the counter to reflect the new current row
position.

> I feel horrible in imagining to manage the combination of functions that
> don't support subfunc, ones that does and ranking aggregates but it's
> possible.

Well, if you handle each window/ranking aggregate in a separate Window
node, like you did, it should be much easier to manage. If it gets
overwhelmingly complex, you could also split it into two slightly
different Window nodes, one for the generic case, for aggregates that
don't have subfunc, and one for ones that do.

(the point is moot, if we go with the API I proposed above, using a
Window object)

All in all, this is such a monstrous piece of functionality, that if we
try to implement everything in one patch, it'll never get finished. So
we clearly have to implement this in phases, and just make sure we don't
paint ourselves in the corner with the design.

I'd suggest:

1. Implement Window node, with the capability to invoke an aggregate
function, using the above API. Implement required parser/planner
changes. Implement a few simple ranking aggregates using the API.
2. Implement glue to call normal aggregates through the new API
3. Implement the optimization to drop rows that are no longer needed
(signal_cutoff can be a no-op until this phase)
4. Implement window framing (the frame can always be all rows in the
partition, or all rows until the current row, until this phase)
5. Expose the new API to user-defined aggregates. It can be an internal
API only used by built-in functions until this phase

I believe you already have phase 1 in your patch, except for the API
changes.

>> PS. Have you looked at the "hypothetical set functions" in SQL2003?
>
> I had a glance but not deeply, since I found I would be lost in design
> if deeply diving into it. :)

Yeah ;-). A quick glance suggests that it won't affect aggregate the API
we come up with, but will require changes to parser/executor.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2008-09-02 13:09:37 Re: Window functions patch v04 for the September commit fest
Previous Message Zdenek Kotala 2008-09-02 12:41:27 Page layout footprint