Quick Links

Re: BUG #15307: Low numerical precision of (Co-) Variance

From:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To:	erich(at)debian(dot)org, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #15307: Low numerical precision of (Co-) Variance
Date:	2018-08-07 06:52:42
Message-ID:	CAEZATCXhKrKbrGX8iPPzXZwPq8xEqkJgv9_Ga8ynJo3A=E19DQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

On 1 August 2018 at 14:35, PG Bug reporting form <noreply(at)postgresql(dot)org> wrote:
> Numerical precision of variance computations in PostgreSQL is too low.
>
> SELECT VAR_SAMP(x::float8), COVAR_SAMP(x, x), VAR_SAMP(x)
> FROM (SELECT 1000000.01 as x UNION SELECT 999999.99 as x) AS x
>
> The first two give the low-precision answer 0.000244140625 instead of
> 0.0002. Interestingly enough, VAR_SAMP(x) is okay ...

For a number of those statistical aggregates, PostgreSQL provides 2
implementations -- one implemented using double precision floating
point arithmetic, which is much faster, but necessarily less accurate
and possibly platform-dependent; and one implemented using the
arbitrary precision numeric datatype, which will return much more
accurate results. For any input datatypes other than floating point,
you will automatically get the latter, which is what you're seeing
with var_samp(x), when you're not explicitly casting the input to
float8.

However, all-the 2-argument aggregates such as corr() and
covar_pop/samp() currently only have floating point implementations,
and suffer from the problem you report, which I agree, is not great.
If we can easily improve the accuracy of those aggregates, then I
think it is worthwhile.

Using a two pass approach isn't really an option, given the way that
aggregates work in PostgreSQL, however, implementing Welford's
algorithm looks to be quite straightforward. I had a quick play and I
found that it fixed the accuracy problem with no noticeable
performance penalty -- there are a few extra cycles in the accumulator
functions, but compared to the overall per-tuple overhead, that
appears to be negligible.

I'll post something shortly.

Regards,
Dean

In response to

BUG #15307: Low numerical precision of (Co-) Variance at 2018-08-01 13:35:13 from PG Bug reporting form

Responses

Re: BUG #15307: Low numerical precision of (Co-) Variance at 2018-08-07 11:58:11 from Erich Schubert

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Haribabu Kommi	2018-08-07 09:40:24	Re: BUG #15310: pg_upgrade dissociates event triggers from extensions
Previous Message	Dean Rasheed	2018-08-07 06:31:26	Re: Fwd: Problem with a "complex" upsert

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Arcadiy Ivanov	2018-08-07 07:14:56	Re: Optimizer misses big in 10.4 with BRIN index
Previous Message	Amit Langote	2018-08-07 06:21:31	Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation