BUG #15307: Low numerical precision of (Co-) Variance

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: erich(at)debian(dot)org
Subject: BUG #15307: Low numerical precision of (Co-) Variance
Date: 2018-08-01 13:35:13
Message-ID: 153313051300.1397.9594490737341194671@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

The following bug has been logged on the website:

Bug reference: 15307
Logged by: Erich Schubert
Email address: erich(at)debian(dot)org
PostgreSQL version: 9.6.6
Operating system: Linux
Description:

Numerical precision of variance computations in PostgreSQL is too low.

SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x)
FROM (SELECT 1000000.01 as x UNION SELECT 999999.99 as x) AS x

The first two give the low-precision answer 0.000244140625 instead of
0.0002. Interestingly enough, VAR_SAMP(x) is okay - I guess that postgres
may autodetect a fixed-precision decimal here, or use some high precision
code.

If you add just another digit (10 million +- 1 cent), the bad functions even
return a variance of 0:

SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x)
FROM (SELECT 10000000.01 as x UNION SELECT 9999999.99 as x) AS x

The reason is catastrophic cancellation. Apparently, the covariance is
computed using the E[XY]-E[X]E[Y] approach, which suffers from low numerical
precision.

While I report this against version 9.6.6 (since I used sqlfiddle), this
clearly is present still in Git:

https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/float.c#L2606
https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/float.c#L2969

it should also affect the general "numeric" version (i.e. all versions of
variance/covariance/standard deviation use the known-bad equation), but for
integers it usually will be fine as long as the sum-of-squares can be
stored:
https://github.com/postgres/postgres/blob/6bf0bc842bd75877e31727eb559c6a69e237f831/src/backend/utils/adt/numeric.c#L4883

There are a number of methods to get a reasonable precision. The simplest to
implement is a two pass approach: first compute the mean of each variable,
then compute E[(X-meanX)*(Y-meanY)] in a second pass. This will usually give
the best precision. Faster (single-pass) methods can be found in literature
from the 1970s, and also Donald Knuth's "The Art of Computer Programming".
In particular Young and Cramer's version performed quite well (and
surprisingly much better than Welford's; supposedly due to CPU pipelining).
Single-pass and parallelization-friendly approaches (interesting to use in
particular in distributed databases, but also to avoid IO cost) with good
accuracy are studied in:
> Erich Schubert, and Michael Gertz. Numerically Stable Parallel Computation
of (Co-)Variance. In: Proceedings of the 30th International Conference on
Scientific and Statistical Database Management (SSDBM), Bolzano-Bozen,
Italy. 2018, 10:1–10:12. DOI: 10.1145/3221269.3223036.
https://dl.acm.org/citation.cfm?doid=3221269.3223036

The problem has also been observed in other SQL databases (MS - others like
MySQL have implemented a numeric stable single-pass approach), see e.g.:
> Kamat, Niranjan, and Arnab Nandi. "A Closer Look at Variance
Implementations In Modern Database Systems." ACM SIGMOD Record 45.4 (2017):
28-33. DOI: 10.1145/3092931.3092936.
https://dl.acm.org/citation.cfm?id=3092936

Sorry, I do not know the PostgreSQL internals (aggregation...) well enough
to provide a patch.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2018-08-02 10:38:16 Re: Fwd: Problem with a "complex" upsert
Previous Message Tom Lane 2018-08-01 13:29:48 Re: BUG #15306: Use pkg-config for searching libxml2 header

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2018-08-01 13:40:14 Re: [HACKERS] Can ICU be used for a database's default sort order?
Previous Message Tomas Vondra 2018-08-01 13:33:47 Re: New Defects reported by Coverity Scan for PostgreSQL