Re: [PATCH] Negative Transition Aggregate Functions (WIP)

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Negative Transition Aggregate Functions (WIP)
Date: 2014-01-14 08:09:06
Message-ID: CAApHDvpximY_AEdmDx2XOx2bgf3Qx6Vd4pV728vbSnxqWecHrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2014 at 2:29 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Florian Pflug <fgp(at)phlo(dot)org> writes:
> > I think it'd be worthwile to get this into 9.4, if that's still an
> option,
> > even if we only support COUNT.
>
> My thought is
>
> (1) we can certainly implement inverse transitions for COUNT() and the
> integer variants of SUM(), and that alone would be a killer feature for
> some people.
>
>
100% Agreed about sum(int) and count, but I've so far managed only to find
problems with SUM(numeric), other inverse transition functions seem just
fine with numeric types. e.g. AVG(numeric) and STDDEV(numeric), I can't
produce any results that differ from the unpatched head... but perhaps
someone can give me a case where things could vary?

I've been testing with something like:
select avg(n) over w,SUM(3::float) over w
from (values(1,10::numeric),(2,0),(3,0)) v(i,n)
window w as (order by i rows between current row and unbounded following);

where I included the sum(float) to force the old school brute force
transitions to be performed. I then compared the results to the ones given
without the sum(float) tagged on the end. I've so far not found anything
out of place.

(2) the float and numeric variants should be implemented under nondefault
> names (I'm thinking FAST_SUM(), but bikeshed away). People who need
> extra speed and don't mind the slightly different results can alter
> their queries to use these variants.
>

I'm not as keen on this idea, as if someone later thought of a nice way to
allow the inverse functions to provide exactly the same result as if they
were not used then the would become redundant legacy that would likely have
to be supported forever and a day, and I know we already have things like
that, but of course we want to minimise that as much as possible.

If AVG(numeric) and the stddev numeric aggregates turn out to be valid
then numeric_avg_accum_inv() must remain in core code to support that and
this is the exact function that is required for a user to define their own
SUM(numeric) with. We could leave it at that and document what a user must
do to create their own FAST_SUM then perhaps leave the including these in
core decision to later based maybe on the number of complaints about
SUM(numeric) being slow when used in windows with a moving frame head.

(For me) It does not seem too unreasonable to think that one day we might
decide that:

SELECT AVG(n) FROM (VALUES(10::NUMERIC(10,3)),(0),(0)) v(n);

return 3.333 instead of 3.3333333333333333 like it does now.

If we ever did that then we could support these numeric inverse transitions
on SUM() without having to worry about weird extra trailing zeros. Instead
we'd just give the user what they asked for, when they asked for it.
Reasons like that make me think that we might be jumping the gun a little
on FAST_SUM() as it could end up redundant more quickly than we might
initially think.

> One reason I'm thinking this is that whatever we do to ameliorate
> the semantic issues is going to slow down the forward transition
> function --- to no benefit unless the aggregate is being used as a
> window function in a moving window. So I'm less than convinced
> that we *should* implement any of these designs in the default
> aggregates, even if we get to the point where we arguably *could*
> do it with little risk of functional differences.
>
>
So far there's only 1 case of possible slowdown of forward transitions and
that's with numeric types.
I had to change do_numeric_accum() to add a NaN Counter increment and also
change the logic so that it continues summing non NaN values even after the
first NaN was encountered.

Would you see this as too big a performance risk to include in core?

If not then I think we might be able to support AVG(numeric) and
STDDEV(numeric) functions as the patch sits without having to worry that
there might be an extra overhead on forward transitions that include NaN
values.

Florian has done some really good work researching the standards around
this area. His research seems to indicate that the results should be of the
same scale as the inputs, which the current patch does not do, providing
you assume that the user is showing us the intended scale based on the
numbers that we're working with rather than a scale specified by the column
or cast. Florian's idea of how to fix the scale on inverse transition seems
pretty valid to me and I think the overhead of it could be made pretty
minimal. I'm just not sure that I can implement it in such a way as to make
the overhead small enough to look like background noise. But I'm very
willing to give it a try and see...

*** some moments pass ***

I think unless anyone has some objections I'm going to remove the inverse
transition for SUM(numeric) and modify the documents to tell the user how
to build their own FAST_SUM(numeric) using the built in functions to do it.
I'm starting to think that playing around with resetting numeric scale will
turn a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's
there so far, minus sum(numeric), can't go in...

If so then there's 36 aggregate functions ticked off my "create an inverse
transition function for" list! I personally think that's a pretty good win.
I'd rather do this than battle and miss deadlines for 9.4. I'd find that
pretty annoying.

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2014-01-14 08:12:10 Re: Proposal: variant of regclass
Previous Message Hannu Krosing 2014-01-14 08:08:40 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance