Proposal for aggregate-function cleanups in 7.1

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org, pgsql-general(at)postgreSQL(dot)org
Subject: Proposal for aggregate-function cleanups in 7.1
Date: 2000-07-12 22:53:47
Message-ID: 25147.963442427@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

There are quite a number of misfeatures in Postgres' current support for
aggregate functions (COUNT, SUM, etc). We've had previous discussion
threads about them in the past, see pghackers around 1/23/00 and 6/16/99
for example. I propose to fix a few things for 7.1:

* Hard-wired boundary condition handling in ExecAgg prevents
user-defined aggregates from determining their treatment of nulls,
no input rows, etc.
* Some SUM and AVG aggregates are vulnerable to overflow due to poor
choice of datatypes for accumulators.
* Dual-transition-function design is bizarre and unnecessary.

The last of these might break user-defined aggregates, if there are
any out there that use two transition functions, so I'm soliciting
comments first.

Currently, ExecAgg evaluates each aggregate in the following steps:
(initcond1, initcond2 are the initial values and sfunc1, sfunc2, and
finalfunc are the transition functions.)

value1 = initcond1
value2 = initcond2
foreach input_value do
value1 = sfunc1(value1, input_value)
value2 = sfunc2(value2)
value1 = finalfunc(value1, value2)

(It's possible to omit one of sfunc1/sfunc2, and then optionally omit
finalfunc as well, but that doesn't affect things materially other than
creating a bunch of confusing rules about which combinations are legal.)

The following behaviors are currently hard-wired in ExecAgg and can't
be changed by the transition functions:
* If initcond1 is NULL then the first non-NULL input_value is assigned
directly to value1. sfunc1 isn't applied until value1 is non-NULL.
(This behavior is useful for aggregates like min() and max() where
you really want value1 to just be the least or greatest value so far.)
* When the current tuple's input_value is NULL, sfunc1 and sfunc2 are
not applied, instead the previous value1 and value2 are kept.
* If value1 is still NULL at the end (ie, there were no non-null input
values), then finalfunc is not invoked, instead a NULL result is
forced.

These behaviors were necessary back when we didn't have a reasonable
design for NULL handling in the function manager, but now that we do
it is possible to allow the transition functions to control behavior
for NULL inputs and zero input rows, instead of hard-wiring the results.
I propose the following definition instead:

1. If sfunc1 is not marked "strict" in pg_proc, then it is invoked for
every input tuple and must do all the right things for itself.
ExecAgg will not protect it against a NULL value1 or NULL input.
An sfunc1 coded in this style might use value1 = NULL as a flag
for "no input data yet". It is up to the function whether and
how to change the state value when the input_value is NULL.

2. If sfunc1 is marked "strict" then it will never be called with
NULL inputs. Therefore, ExecAgg will automatically skip tuples
with NULL input_values, preserving the previous value1/value2.
Also, if value1 is initially NULL, then the first non-NULL
input_value will automatically be inserted into value1; sfunc1
will only be called beginning with the second non-NULL input.
(This is essentially the same as current behavior.)

3. If finalfunc is not marked "strict" then it will be called in
any case, even if value1 is NULL from its initial setting or
as a result of sfunc1 having returned NULL. It is up to the
finalfunc to return an appropriate value or NULL.

4. If finalfunc is marked "strict" then it cannot be called with
NULL inputs, so a NULL result will be substituted if value1
or value2 is still NULL. (Same as current behavior.)

sfunc2 is a somewhat peculiar appendage in this scheme. To do AVG()
correctly, sfunc2 should only be called for the same input tuples that
sfunc1 is called for (ie, not for NULLs). This implies that a "strict"
sfunc2 must not be called when input_value is NULL, even though
input_value is not actually being passed to sfunc2! This seems like a
pretty weird abuse of the notion of function strictness. It's also quite
unclear whether it makes any sense to have sfunc1 and sfunc2 with
different strictness settings.

What I would *like* to do about this is eliminate value2 and sfunc2 from
the picture completely, leaving us with a clean single-transition-value
design for aggregate handling. That would mean that AVG() would need to
be redone with a specialized transition datatype (holding both the sum and
count) and specialized transition function and final function, rather than
using addition/increment/divide as it does now. Given the AVG fixes I
intend to do anyway, this isn't significant extra work as far as the
standard distribution is concerned. But it would break any existing
user-defined aggregates that may be out there, if they use two transition
functions. Is there anyone who has such functions and wants to object?

As to the other point: currently, the SUM and AVG aggregates use an
accumulator of the same datatype as the input data. This is almost
guaranteed to cause overflow if the input column is int2, and there's
a nontrivial risk for int4 as well. It also seems bizarre to me that
AVG on an integer column delivers an integral result --- it should
yield a float or numeric result so as not to lose the fractional part.

What I propose to do is reimplement SUM and AVG so that there are
basically only two underlying implementations: one with a "float8"
accumulator and one with a "numeric" accumulator. The float8 style
would be used for either float4 or float8 input and would deliver a
float8 result. The numeric style would be used for all numeric and
integral input types and would yield a numeric result of appropriate
precision.

(It might also be worth reimplementing COUNT() to use a "numeric"
counter, so that it doesn't poop out at 2 billion rows. Anyone have
a strong feeling pro or con about that? It could possibly cause problems
for queries that are expecting COUNT() to deliver an int4.)

These changes were not practical before 7.0 because aggregates with
pass-by-reference transition data types leaked memory. The leak problem
is gone, so we can change to wider accumulator datatypes without fear of
that. The code will probably run a little slower than before, but that
seems an acceptable price for solving overflow problems.

Comments, objections, better ideas?

regards, tom lane

Browse pgsql-general by date

  From Date Subject
Next Message Peter Eisentraut 2000-07-12 22:54:58 Re: Re: [NOVICE] newbie problem on creating table
Previous Message Carsten Huettl 2000-07-12 22:45:16 pgsql setup

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2000-07-12 22:53:48 Template matching, a different perspective
Previous Message Pavel Janík ml. 2000-07-12 22:35:46 Re: pg_dump & blobs - editable dump?