Expiring tuples from aggregation in window frames efficiently

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Expiring tuples from aggregation in window frames efficiently
Date: 2013-12-03 09:12:04
Message-ID: CAApHDvqu+yGW0vbPBb+yxHrPG5VcY_kiFYi8xmxFo8KYOczP3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi folks,

I was casting my mind back to an idea that came up when windowing functions
were first implemented in 8.4 during some talk about how ROWS BETWEEN might
implemented in the future. This has now been implemented but with the
implementation there is some quadratic behaviour when tuples move off the
top of the window frame. In this case the aggregate value must be
re-calculated for all the remaining rows in the frame for every row that
exits the window's frame. e.g with SUM(value) OVER(ORDER BY id ROWS
BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) where the frame head is always
at the current row.

There is a comment in nodeWindowAgg.c on line 457 explaining that it may be
possible to optimise this by having functions which can "expire" tuples out
of the aggregate value. These are described as "negative transition
function".
Currently I'm looking over this to try to determine how hard this would be
to implement it, with views that if it is within my ability then I'd like
to have a patch ready for the next commitfest.

The purpose of this email is just to publish my rough implementation ideas
with the hope that someone can tell me if I'm on the right track or that
I'm off track somewhere. Also there may be optimisation opportunities that
I've not thought of.

The idea would be to allow an extra function type when a user does CREATE
AGGREGATE, this would be the function which removes a tuple from
aggregation, I'll call these "aggregate subtraction functions" from now on.

For SUM() I think the aggregation subtraction function would just do a -
instead of a +, and look pretty much the same as intN_sum(). It looks like
it would be similar for AVG() where we'd just remove from the count and the
sum. COUNT(col) and COUNT(*) I would imagine would be quite similar though
from what I've manage to figure out so far the implementation of this is a
bit special.
The specification of this aggregate subtraction function would be optional
in CREATE AGGREGATE and if it was present it would be used instead of
looping over all the rows in the frame each time the head of the window
frame moves to the next row.

I'm thinking that it would be useful to make it so these aggregate
subtraction functions returned true if they've managed to remove the tuple
and false to tell the executor to fall back and loop over each tuple in the
frame. I feel this would be useful as it would allow MAX() and MIN() to at
least partially get on-board with this optimisation.... Let me explain:

If the current MAX() is 10 and we ask the aggregate subtraction function to
subtract a tuple with 9, then the MAX() is still 10. So, in this case we
could just compare 9 with 10 and see that it is lower and then just return
true, thus pretending we actually did the subtraction, but the fact being
no subtraction was required. The same applies to MIN, though of course with
the comparison around the other way. It may also be possible to store a
counter which we could increase each time MAX() receives a tuple with the
current maximum value, this would be set back to 1 when the MAX receives a
higher value. The aggregate subtraction function could decrement that
counter each time it is asked to subtract the current maximum value and
only return false when that counter gets to 0. Though I think this might
mean that MAX() could take a small performance hit when it is used even not
as a windowing aggregate. The pseudo logic would look something along the
lines of:

IN > "1" MAX() = "1", maxcount = 1
IN > "2" MAX() = "2", maxcount = 1 (max increased, reset maxcount to 1)
IN > "2" MAX() = "2", maxcount = 2
OUT > "1" MAX() = "2", maxcount = 2 (value is less than current MAX, so
don't touch the maxcount)
OUT > "2" MAX() = "2", maxcount = 1
OUT > "2" MAX() = ???, maxcount = 0 (maxcount is 0 so return false as we
need to recalculate)

Tonight I'm trying to figure out how all of this could be implemented, I'll
list below my rough idea of how this might be done. Please shout out if
there is something out of order here.

1. Modifiy pg_proc.h adding a new bool flag for aggsubtract function.
2. In CREATE FUNCTION allow AGGSUBTRACT to be specified similar to how
WINDOW is, only demand that these functions return BOOLEAN.
3. Alter CREATE AGGREGATE to allow the AGGSUBTRACT function type, this will
be optional.
4. Write agg subtract functions for COUNT(col), COUNT(*), SUM(), AVG(),
MIN() and MAX()
5. Change eval_windowaggregates() to attempt to use the aggregate
subtraction function if it exists for this aggregate and if that function
returns true use the updated aggregate state.

I think perhaps as a simple initial implementation I could skip steps 2 and
3 and for step 4 just implement int4_sum_neg(), though I'm not quite sure
yet if these steps would be required for initdb to work.

Also the use passing volatile functions to the aggregate requires some
thought. I would imagine the optimisation needs to be disabled in this
case, but I'm not quite sure actually how the code should behave in this
case. I took a quick look at the results of:

CREATE SEQUENCE test_seq;
SELECT currval('test_seq'),
COUNT(*) OVER (ORDER BY x.x ROWS BETWEEN CURRENT ROW AND UNBOUNDED
FOLLOWING),
SUM(nextval('test_seq')) OVER (ORDER BY x.x ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING)
FROM generate_series(1,2) x (x);
DROP SEQUENCE test_seq;

The results are:
currval | count | sum
---------+-------+-----
2 | 2 | 3
3 | 1 | 3

I've not looked to see if the spec has anything about this but, the first
row will have sum as 1+2 then the 2nd row will just have 1 row to aggregate
and the value will be 3, I could see an argument that the results for this
should actually be:

currval | count | sum
---------+-------+-----
2 | 2 | 3
3 | 1 | 2

Though actually making the above the results would require materialising
values for each row instead of calling the volatile function again each
time... Perhaps volatile functions as arguments to aggregate functions when
in a window which the frame head is not fixed at the top of the partition
should have been disallowed in the first place.

If anyone has any time to point out if I'm thinking along the correct lines
for this implementation then that would be really helpful.

Regards

David Rowley

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2013-12-03 09:38:28 Re: Get more from indices.
Previous Message Sawada Masahiko 2013-12-03 08:34:14 Re: Logging WAL when updating hintbit