From: | Erich Schubert <erich(at)debian(dot)org> |
---|---|

To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |

Subject: | Re: BUG #15307: Low numerical precision of (Co-) Variance |

Date: | 2018-08-07 11:58:11 |

Message-ID: | c95877fb-4441-5c2a-7398-1e5305c32656@vitavonni.de |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-bugs pgsql-hackers |

Hi,

> For a number of those statistical aggregates, PostgreSQL provides 2

> implementations -- one implemented using double precision floating

> point arithmetic, which is much faster, but necessarily less accurate

> and possibly platform-dependent; and one implemented using the

> arbitrary precision numeric datatype, which will return much more

> accurate results. For any input datatypes other than floating point,

> you will automatically get the latter, which is what you're seeing

> with var_samp(x), when you're not explicitly casting the input to

> float8.

Ok, I am surprised that this is possible to do with arbitrary precision

here, as for example with 8 data points, it will eventually involve a

division by 7, which cannot be represented even with arbitrary (finite)

precision floating point. But maybe just this division comes late enough

(after the difference) that it just works before being converted to a

float for the division.

> However, all-the 2-argument aggregates such as corr() and

> covar_pop/samp() currently only have floating point implementations,

> and suffer from the problem you report, which I agree, is not great.

> If we can easily improve the accuracy of those aggregates, then I

> think it is worthwhile.

>

> Using a two pass approach isn't really an option, given the way that

> aggregates work in PostgreSQL, however, implementing Welford's

> algorithm looks to be quite straightforward. I had a quick play and I

> found that it fixed the accuracy problem with no noticeable

> performance penalty -- there are a few extra cycles in the accumulator

> functions, but compared to the overall per-tuple overhead, that

> appears to be negligible.

Instead of Welford, I recommend to use the Youngs-Cramer approach. It is

almost the same; roughly it is aggregating sum-X and sum((X-meanX)²)

directly (whereas Welford updates mean-X, and sum((X-meanX)^2)). So the

first aggregate is actually unchanged to the current approach.

In my experiments, this was surprisingly much faster when the data was a

double[]; explainable by CPU pipelining (see the Figure in the paper I

linked - the division comes late, and the next step does not depend on

the division, so pipelining can already process the next double without

waiting for the division to finish).

Youngs & Cramer original publication:

https://www.jstor.org/stable/pdf/1267176.pdf

The (non-parallel) implementation of the classic algorithms used in the

SSDBM 2018 paper are (templated only for aggregate precision; no guard

against len=0):

template <typename T = double>

double youngs_cramer_var(const double* const data, const size_t len) {

T sum = data[0], var = 0;

for(size_t i=1; i<len;) {

const size_t oldi = i;

const double x = data[i++];

sum += x;

double tmp = i * x - sum;

var += tmp * tmp / (i * (double) oldi);

}

return var / len;

}

Close to Welfords original article:

template <typename T = double>

double welford_var(const double* const data, const size_t len) {

T mean = data[0], var = 0;

for(size_t i=1; i<len; ) {

const size_t oldi = i;

const double x = data[i++];

const double delta = x - mean;

mean = mean * oldi / (double) i + x / i;

var += oldi / (double) i * delta * delta;

}

return var / len;

}

From Knuth's book, attributed to Welford (but usuing fewer divisons),

likely the most widely known incremental version:

template <typename T = double>

double knuth_var(const double* const data, const size_t len) {

T mean = data[0], xx = 0;

for(size_t i=1; i<len;) {

const double v = data[i++];

const double delta = v - mean;

mean += delta / i;

xx += delta * (v - mean);

}

return xx / len;

}

In both Welford and Knuth, the "mean" aggregate depends on the division;

with YC it does not, so I recommend YC.

If Postgres could use parallel aggregation, let me know. I can assist

with the proper aggregation of several partitions (both distributed, or

multithreaded). Using multiple partitions can also slightly improve

precision; but it is mostly interesting to process multiple chunks in

parallel.

If I have some spare time to clean it up some more, I intend to

open-source the entire experimental code from SSDBM18 for reproducibility.

Regards,

Erich

- Re: BUG #15307: Low numerical precision of (Co-) Variance at 2018-08-07 06:52:42 from Dean Rasheed

- Re: BUG #15307: Low numerical precision of (Co-) Variance at 2018-08-09 11:02:06 from Dean Rasheed

From | Date | Subject | |
---|---|---|---|

Next Message | PG Bug reporting form | 2018-08-07 13:35:59 | BUG #15313: Waiting `startup` process on a streaming replica |

Previous Message | Michael Paquier | 2018-08-07 11:29:10 | Re: BUG #15310: pg_upgrade dissociates event triggers from extensions |

From | Date | Subject | |
---|---|---|---|

Next Message | Jesper Pedersen | 2018-08-07 12:14:08 | Re: partition tree inspection functions |

Previous Message | Michael Paquier | 2018-08-07 11:40:43 | Re: Negotiating the SCRAM channel binding type |