Re: [HACKERS] Implementing STDDEV and VARIANCE

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-23 18:27:01
Message-ID: 6270.948652021@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
> I'd like to implement stddev and variance aggregates in Postgres. This is a
> long standing TODO item.

> 1. Add three columns to pg_aggregate for the additional third transition
> function.

> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>> All that you need to implement this is room to keep two running
>> sums instead of one. I haven't looked at pgsql's aggregate functions,
>> but I'd hope that the working state can be a struct not just a
>> single number.

> I saw no other way than adding another transition function and logic, as
> this might break user-defined aggregates (are there any around?).

Yes, there are some, and no you do not need a third transition
function. What you do need is a datatype holding two values that
you can use as the transition datatype, plus appropriate functions
for the transition functions.

The reason there are two transition functions at all is that it allows
some of the standard aggregate functions to be built using arithmetic
functions that exist anyway --- for example, float8 AVG is built from
float8 addition, float8 increment, and float8 divide, with only float8
increment being a function you wouldn't have anyway. However, the
whole thing is really a kluge; nodeAvg.c has all sorts of weird little
hacks that are necessary to make AVG have the right behavior in boundary
conditions such as no-tuples-in. (A blind application of float8 divide
would lead to a divide-by-zero exception in that case.) These hacks
limit the ability of user-defined aggregates to control their behavior
at the boundary conditions. Nor can an aggregate control its response
to NULL data values; that's hardwired into nodeAvg.c as well.

A cleaner solution would have just one transition function and one
transition data value, plus an optional finalization function that takes
only the one data value. For AVG the transition data type would be a
two-field struct and the transition function would update both fields.
This would halve the function-call overhead per tuple. We'd have to
provide specialized transition and finalization functions for AVG and
probably a couple of the other standard aggregates, but that would allow
us to rely on those functions to do the right things at the boundary
conditions; nodeAvg.c could stop foreclosing the choices.

I have been thinking about proposing such a change along with the
function manager rewrite that is now planned for 7.1. That would be
a good time because user-defined aggregates would need to be revisited
anyway. Also, if the transition functions are to determine the behavior
for no-tuples and for NULL data values, they had better be able to pass
and return NULLs cleanly; which depends on the function manager rewrite.

In short, I'd suggest thinking about implementing STDDEV with a
single transition function and transition data value. You'll need
specialized functions for it anyway, so I don't see that you're saving
any work by proposing a third transition function. What you will need
instead is a pg_type entry for the transition data type, but since that
data type needn't have any operators, there's not much work needed.

> 3. I'm planning to implement this for types float4, float8 and numeric. Any
> other types also? int[2,4,8] don't seem logical, as these would introduce
> serious rounding errors.

I'd suggest just two basic implementations, with float8 and numeric
internal calculations respectively. Data of other numeric types can
be type-coerced to one of these (that might even happen automatically),
but the output would always be either float8 or numeric. (I don't think
float4 has enough precision to generate reliable stddev numbers except
in very narrow conditions...)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-01-23 18:46:29 Re: [HACKERS] column aliases
Previous Message Tom Lane 2000-01-23 17:55:36 Re: pg_dump possible fix, need testers. (was: Re: [HACKERS] pg_dump disaster)