Re: BUG #15307: Low numerical precision of (Co-) Variance

From: Erich Schubert <erich(at)debian(dot)org>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15307: Low numerical precision of (Co-) Variance
Date: 2018-08-07 11:58:11
Message-ID: c95877fb-4441-5c2a-7398-1e5305c32656@vitavonni.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

> For a number of those statistical aggregates, PostgreSQL provides 2
> implementations -- one implemented using double precision floating
> point arithmetic, which is much faster, but necessarily less accurate
> and possibly platform-dependent; and one implemented using the
> arbitrary precision numeric datatype, which will return much more
> accurate results. For any input datatypes other than floating point,
> you will automatically get the latter, which is what you're seeing
> with var_samp(x), when you're not explicitly casting the input to
> float8.
Ok, I am surprised that this is possible to do with arbitrary precision
here, as for example with 8 data points, it will eventually involve a
division by 7, which cannot be represented even with arbitrary (finite)
precision floating point. But maybe just this division comes late enough
(after the difference) that it just works before being converted to a
float for the division.
> However, all-the 2-argument aggregates such as corr() and
> covar_pop/samp() currently only have floating point implementations,
> and suffer from the problem you report, which I agree, is not great.
> If we can easily improve the accuracy of those aggregates, then I
> think it is worthwhile.
>
> Using a two pass approach isn't really an option, given the way that
> aggregates work in PostgreSQL, however, implementing Welford's
> algorithm looks to be quite straightforward. I had a quick play and I
> found that it fixed the accuracy problem with no noticeable
> performance penalty -- there are a few extra cycles in the accumulator
> functions, but compared to the overall per-tuple overhead, that
> appears to be negligible.
Instead of Welford, I recommend to use the Youngs-Cramer approach. It is
almost the same; roughly it is aggregating sum-X and sum((X-meanX)²)
directly (whereas Welford updates mean-X, and sum((X-meanX)^2)). So the
first aggregate is actually unchanged to the current approach.
In my experiments, this was surprisingly much faster when the data was a
double[]; explainable by CPU pipelining (see the Figure in the paper I
linked - the division comes late, and the next step does not depend on
the division, so pipelining can already process the next double without
waiting for the division to finish).

Youngs & Cramer original publication:
https://www.jstor.org/stable/pdf/1267176.pdf

The (non-parallel) implementation of the classic algorithms used in the
SSDBM 2018 paper are (templated only for aggregate precision; no guard
against len=0):

template <typename T = double>
double youngs_cramer_var(const double* const data, const size_t len) {
    T sum = data[0], var = 0;
    for(size_t i=1; i<len;) {
        const size_t oldi = i;
        const double x = data[i++];
        sum += x;
        double tmp = i * x - sum;
        var += tmp * tmp / (i * (double) oldi);
    }
    return var / len;
}

Close to Welfords original article:

template <typename T = double>
double welford_var(const double* const data, const size_t len) {
    T mean = data[0], var = 0;
    for(size_t i=1; i<len; ) {
        const size_t oldi = i;
        const double x = data[i++];
        const double delta = x - mean;
        mean = mean * oldi / (double) i + x / i;
        var += oldi / (double) i * delta * delta;
    }
    return var / len;
}

From Knuth's book, attributed to Welford (but usuing fewer divisons),
likely the most widely known incremental version:

template <typename T = double>
double knuth_var(const double* const data, const size_t len) {
    T mean = data[0], xx = 0;
    for(size_t i=1; i<len;) {
        const double v = data[i++];
        const double delta = v - mean;
        mean += delta / i;
        xx += delta * (v - mean);
    }
    return xx / len;
}

In both Welford and Knuth, the "mean" aggregate depends on the division;
with YC it does not, so I recommend YC.

If Postgres could use parallel aggregation, let me know. I can assist
with the proper aggregation of several partitions (both distributed, or
multithreaded). Using multiple partitions can also slightly improve
precision; but it is mostly interesting to process multiple chunks in
parallel.
If I have some spare time to clean it up some more, I intend to
open-source the entire experimental code from SSDBM18 for reproducibility.

Regards,
Erich

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2018-08-07 12:14:08 Re: partition tree inspection functions
Previous Message Michael Paquier 2018-08-07 11:40:43 Re: Negotiating the SCRAM channel binding type

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-08-07 13:35:59 BUG #15313: Waiting `startup` process on a streaming replica
Previous Message Michael Paquier 2018-08-07 11:29:10 Re: BUG #15310: pg_upgrade dissociates event triggers from extensions