Re: BUG #19340: Wrong result from CORR() function

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Oleg Ivanov <o15611(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19340: Wrong result from CORR() function
Date: 2025-12-03 16:33:36
Message-ID: 648885.1764779616@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Poking further at this, I found that my v2 patch fails that principle
>> in one case:
>>
>> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
>> corr
>> ------
>>
>> (1 row)
>>
>> We see that Y is constant and therefore return NULL, despite the
>> other NaN input.

> I think that would be more readable as 2 separate "if" statements,

Actually, in the light of morning I think this behavior is probably
fine. This example treads on two different edge-cases, one where
we're supposed to return NULL and one where we're supposed to return
NaN, and it's not that open-and-shut which rule should win. A
counterexample here is float8_covar_samp: that returns NULL if there
are less than two input values, and will still do so if there is one
input that is NaN. Should we change that? I don't think so.
The fact that HEAD returns NULL for this corr() case as long as
there is not enough input to create a roundoff problem also suggests
to me that that's the behavior to stick with.

So I'm now thinking that the NULL-output rule should win in cases
where they are both applicable. However, I don't know if we want
to take that so far as to say that constant-NaN input should
produce a NULL; the argument that NaN isn't really a constant
still has some force in my mind. What do you think?

> Another case to consider is this:

> SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;

> Here Sxx and Syy end up being zero, so the current code returns NULL,
> but with the patch it returns NaN (because Sxy is also zero). It's not
> quite obvious what to do in this case (the correct answer is 1, but
> there' no way of reliably computing that with double precision
> arithmetic). My first thought is that we should probably try to limit
> NaN results to NaN inputs, and so it would be better to continue to
> return NULL for cases like this where either Sxx or Syy are zero, even
> though the inputs weren't quite constant.

Good example. I had wondered whether to retain the final test for zero
Sxx/Syy, and this shows that we do need that (along with a comment
explaining that roundoff error could produce zeroes even though we
know the inputs weren't constant).

> Then there's this:

> SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;

> Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
> product Sxx * Syy underflows to zero. So in both HEAD and with the
> patch, this returns Infinity. That's not good, given that the
> correlation coefficient is supposed to lie in the range [-1,1].

> The correct answer in this case is also 1, which could be achieved by
> taking the square roots of Sxx and Syy separately, before multiplying,
> but it might also be sensible to clamp the result to the range [-1,1].

I like the idea of taking the square roots separately. I believe
sqrt() is a hardware operation on pretty much any machine people still
care about, and we're doing this part only once per aggregation.
So the cost seems fairly minimal.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Previous Message Tom Lane 2025-12-03 15:12:07 Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation