Re: Window functions patch v04 for the September commit fest

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Window functions patch v04 for the September commit fest
Date: 2008-09-01 18:00:47
Message-ID: 48BC2DCF.2090103@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hitoshi Harada wrote:
> 2008/8/30 Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>:
>> Here's the latest window functions patch against HEAD. It seems to be
>> ready for the September commit fest, as added documents, WINDOW clause
>> feature and misc tests. I guess this would be the window functions
>> feature freeze for 8.4. The remaining feature will be implemented for
>> the later release.
>>
>> This patch consists of:
>> - windowed aggregates
>> - cooperate with GROUP BY aggregates
>> - some optimizations with multiple windows
>> - ranking functions
>> - WINDOW clause
>> - windowing SQL regression tests
>> - sgml documents
>>
>> Looking up the total road map, the dropped features are:
>>
>> - sliding window (window framing)
>> - lead(), lag(), etc. that reach for random rows
>> - user defined window functions

Ok, I'm starting to read up on SQL2003 window functions, and to review
this patch. Here's a long brain-dump of my first thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handle
all of the window function stuff in SQL2003 in the future, as well as
user-defined-window-functions in the spirit of PostgreSQL extensibility.
Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number of
input rows. These are typically used in OLAP queries that process
gazillions of rows. Some window functions need access to all rows in the
window frame or partition, so you can't reasonably expect great
performance with those if the working set is large, but those that
don't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for
user-defined-window-functions first. Ideally, the interface is such that
all the window functions and aggregates defined in SQL2003, and others
we might want to have in core or as pgfoundry projects, can be
implemented using it.

I think there's still some confusion in the design document about what a
frame is. The terminology section is great, it helped me a lot to get
started, so thank you for that. However, in the "Actual design in the
patch" you describe how a window function works, and you talk about a
window frame, but elsewhere in the doc you note that window frames are
not implemented yet.

As already discussed, there's two different kind of window expressions:
window aggregates, and ranking aggregates (you call them window
functions in your doc). It seems that they are different enough that we
can't bang them together. For example, you can't use a window frame
clause with ranking aggregates, and ranking aggregates need to see the
whole row, whereas window aggregates only see the expressions passed as
argument.

API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, I
propose an interface using a set-returning function. Similar to Simon's
proposal, but taking into account that a ranking function might need to
see some or all input tuples before returning anything.

For each input tuple, the window function can return any number of
tuples, including 0. After processing all input values, though, the
total number of output rows returned must match the total number of
input values fed into it. The function must eventually output one value
for each input value, in the same order.

This is a flexible design that allows for different kind of
implementations; the function can return one output tuple for each input
tuple (ROW_NUMBER(), RANK() etc.), or it can swallow all input tuples
first, and only then return all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions would
work. Imagine input values 1, 3, 4, 4, 5:

RANK
----

Invocation Input Value returned Internal state (after)
1 1 1 (last value 1, count 1, rank=1)
2 3 2 (last value 2, count 1, rank=2)
3 4 3 (last value 4, count 1, rank=3)
4 4 3 (last value 4, count 2, rank=3)
5 5 5 (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in your
current patch, keep the last value returned, a row count, and the
current rank as state.

LAG
---

Internal state is a FIFO queue of values. Each input value is pushed to
the FIFO, and the tuple that falls out of the queue is returned.

For example, LAG(<col>,2,0):

Invocation Input row Value returned Internal state (after)
1 1 0 (1)
2 3 0 (3, 1)
3 4 1 (4, 3)
4 4 3 (4, 4)
5 5 4 (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the current
input tuple for the rest of the input tuples, and after the last input
tuple, return <offset> number of default values.

For example, LEAD(<col>,2,0):

Invocation Input row Value returned Internal state (after)
1 1 <none> (offsetbegin = 1, pad=2)
2 3 <none> (offsetbegin = 0, pad=2)
3 4 4 (offsetbegin = 0, pad=2)
4 4 4 (offsetbegin = 0, pad=2)
5 5 5 (offsetbegin = 0, pad=2)
6 <none> 0 (offsetbegin = 0, pad=1)
7 <none> 0 (offsetbegin = 0, pad=0)

For functions like LEAD or CUME_DIST that don't immediately start
returning values, the executor will need a tuplestore to buffer input
rows until it gets their values from the window function.

The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (
SFUNC = sfunc,
STYPE = state_data_type,
OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE
AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates,
sfunc is passed the whole input row, so that e.g RANK can compare it
against the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returns
no more rows, after each sfunc invocation.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and
return one output value. In normal GROUP BY expressions, the aggregate
function is called once each group, but in a window aggregate expression
with a window frame (aka sliding window), there is as many "groups" as
there is rows.

In fact, any normal aggregate function can also be used as a window
aggregate and vice versa. The trivial implementation is to have a buffer
to hold all values currently in the window frame, and for each input
row, invoke the aggregate function over all the rows currently in the
frame. I think we'll need support that, to allow using any built-in or
user-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the common
aggregates. Aggregates like COUNT, SUM, and AVG can easily be
implemented more efficiently, without rescanning all the tuples in the
group.

I propose an extension to the current CREATE AGGREGATE syntax to allow
for optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
[ , SUBTRACTFUNC = subfunc,
)

subfunc is like sfunc, except that the input value is "subtracted" from
the current value. On each input row, the executor calls sfunc for all
new tuples that enter the window frame, and subfunc for all tuples that
exit it. Another difference is that finalfunc can be called many times
during the lifecycle of an aggregate invocation. For example, the
subfunc for COUNT would decrement the row count by one, and for SUM,
subfunc would subtract the removed value from the current sum. Perhaps
we should also support an "opportunistic subfunc", that could either
perform the subtraction, or return a special value indicating that it
can't be done, and we need to calculate the aggregate from scratch as if
subfunc wasn't defined. That would be useful for MIN/MAX, which can be
implemented by just keeping track of the current MIN/MAX value, and
"subtracting" any other value than the current MIN/MAX is a no-op, but
"subtracting" the current MIN/MAX value requires scanning all the rest
of the tuples from scratch.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

PPS. I was just about to send this, when I noticed that LEAD/LAG are in
fact not in the SQL2003 draft I have at hand. In fact, none of the
window functions defined there, RANK, DENSE_RANK, PERCENT_RANK,
CUME_DIST and ROW_NUMBER, require random access to the tuples. Still
seems like a good idea to be have the flexibility for them.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryan Bradetich 2008-09-01 19:32:09 Re: [Patch Review] TRUNCATE Permission
Previous Message Simon Riggs 2008-09-01 16:55:14 Re: Window functions patch v04 for the September commit fest