Re: [HACKERS] Implementing STDDEV and VARIANCE

From: Jeroen van Vianen <jeroen(at)design(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-23 20:41:35
Message-ID: 388B677F.381B10E4@design.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
>
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
>
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >> All that you need to implement this is room to keep two running
> >> sums instead of one. I haven't looked at pgsql's aggregate functions,
> >> but I'd hope that the working state can be a struct not just a
> >> single number.
>
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
>
> Yes, there are some, and no you do not need a third transition
> function. What you do need is a datatype holding two values that
> you can use as the transition datatype, plus appropriate functions
> for the transition functions.

So it might be better to have this type hold three values (n, sum(x) and
sum(x^2)) and only use one transition function to update all three
values at once and have the finalization function do the necessary
calculations.

> The reason there are two transition functions at all is that it allows
> some of the standard aggregate functions to be built using arithmetic
> functions that exist anyway --- for example, float8 AVG is built from
> float8 addition, float8 increment, and float8 divide, with only float8
> increment being a function you wouldn't have anyway. However, the
> whole thing is really a kluge; nodeAvg.c has all sorts of weird little
> hacks that are necessary to make AVG have the right behavior in boundary
> conditions such as no-tuples-in. (A blind application of float8 divide
> would lead to a divide-by-zero exception in that case.) These hacks
> limit the ability of user-defined aggregates to control their behavior
> at the boundary conditions. Nor can an aggregate control its response
> to NULL data values; that's hardwired into nodeAvg.c as well.

Yes, I saw these little hacks. And there are boundary conditions with
stddev and variance with no rows and one row (for sample stddev and
sample variance).

> A cleaner solution would have just one transition function and one
> transition data value, plus an optional finalization function that takes
> only the one data value. For AVG the transition data type would be a
> two-field struct and the transition function would update both fields.
> This would halve the function-call overhead per tuple. We'd have to
> provide specialized transition and finalization functions for AVG and
> probably a couple of the other standard aggregates, but that would allow
> us to rely on those functions to do the right things at the boundary
> conditions; nodeAvg.c could stop foreclosing the choices.

So you suggest changing all transition functions for 7.1 to keep all the
state they need?

> I have been thinking about proposing such a change along with the
> function manager rewrite that is now planned for 7.1. That would be
> a good time because user-defined aggregates would need to be revisited
> anyway. Also, if the transition functions are to determine the behavior
> for no-tuples and for NULL data values, they had better be able to pass
> and return NULLs cleanly; which depends on the function manager rewrite.

Are you suggesting also to change the lay-out of pg_attribute in 7.1 to
something like this and do updates for all built-in types and
aggregates?

aggname
aggowner
aggtype
aggtranstype [ n, sx, sx2 ]
agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
aggtransfunction function that does ( n = n + 1, sx = sx + x,
sx2 = sx2 + x * x )
aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) /
n

Might it be better for me to wait for 7.1 before implementing stddev and
variance?

> In short, I'd suggest thinking about implementing STDDEV with a
> single transition function and transition data value. You'll need
> specialized functions for it anyway, so I don't see that you're saving
> any work by proposing a third transition function. What you will need
> instead is a pg_type entry for the transition data type, but since that
> data type needn't have any operators, there's not much work needed.

OK, clear.

> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
>
> I'd suggest just two basic implementations, with float8 and numeric
> internal calculations respectively. Data of other numeric types can
> be type-coerced to one of these (that might even happen automatically),
> but the output would always be either float8 or numeric. (I don't think
> float4 has enough precision to generate reliable stddev numbers except
> in very narrow conditions...)

OK, only float8 and numeric.

Jeroen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message The Hermit Hacker 2000-01-23 20:43:22 Re: [HACKERS] Happy column dropping
Previous Message The Hermit Hacker 2000-01-23 20:41:03 Re: [HACKERS] Happy column dropping