Re: Combining Aggregates

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>
Subject: Re: Combining Aggregates
Date: 2016-03-30 03:14:38
Message-ID: CAKJS1f8QRDLvewk336SzUzxiXH1wBHG8rdKsqWEHbAraMXA2_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26 March 2016 at 15:07, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:

Many thanks Robert for committing the serialize states portion.

> 0005:
> Haribabu's patch; no change from last time.

Just in case you jump ahead. I just wanted to re-highlight something
Haribabu mentioned a while ago about combining floating point states
[1]. This discussion did not happen on this thread, so to make sure it
does not get lost...

As of today aggregate calculations for floating point types can vary
depending on the order in which values are aggregated.

For example:

create table f8 (num float8);
insert into f8 select x/100.0 from generate_series(1,10000) x(x);
select stddev(num order by num) from f8;
stddev
------------------
28.8689567990717
(1 row)

select stddev(num order by num desc) from f8;
stddev
------------------
28.8689567990716
(1 row)

select stddev(num order by random()) from f8;
stddev
------------------
28.8689567990715
(1 row)

And of course the execution plan can determine the order in which rows
are aggregated, even if the underlying data does not change.

Parallelising these aggregates increases the chances of seeing these
variations as the number of rows aggregated in each worker is going to
vary on each run, so the numerical anomalies will also vary between
each run.

I wrote in [1]:
> We do also warn about this in the manual: "Inexact means that some
> values cannot be converted exactly to the internal format and are
> stored as approximations, so that storing and retrieving a value might
> show slight discrepancies. Managing these errors and how they
> propagate through calculations is the subject of an entire branch of
> mathematics and computer science and will not be discussed here,
> except for the following points:" [1]
> [1] http://www.postgresql.org/docs/devel/static/datatype-numeric.html

Does this text in the documents stretch as far as the variable results
from parallel aggregate for floating point types? or should we be more
careful and not parallelise these, similar to how we didn't end up
with inverse aggregate transition functions for floating point types?

I'm personally undecided, and would like to hear what others think.

[1] http://www.postgresql.org/message-id/CAKJS1f_HpLFhkd2yLfrsrmUMBZWQkGvJCWX21B_xg1A-0pzwHw@mail.gmail.com
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2016-03-30 03:43:36 Re: IF (NOT) EXISTS in psql-completion
Previous Message Tom Lane 2016-03-30 03:13:03 Re: Please correct/improve wiki page about abbreviated keys bug