| From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
|---|---|
| To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
| Cc: | Oleg Ivanov <o15611(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Re: BUG #19340: Wrong result from CORR() function |
| Date: | 2025-12-03 16:33:36 |
| Message-ID: | 648885.1764779616@sss.pgh.pa.us |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On Wed, 3 Dec 2025 at 01:27, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Poking further at this, I found that my v2 patch fails that principle
>> in one case:
>>
>> regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
>> corr
>> ------
>>
>> (1 row)
>>
>> We see that Y is constant and therefore return NULL, despite the
>> other NaN input.
> I think that would be more readable as 2 separate "if" statements,
Actually, in the light of morning I think this behavior is probably
fine. This example treads on two different edge-cases, one where
we're supposed to return NULL and one where we're supposed to return
NaN, and it's not that open-and-shut which rule should win. A
counterexample here is float8_covar_samp: that returns NULL if there
are less than two input values, and will still do so if there is one
input that is NaN. Should we change that? I don't think so.
The fact that HEAD returns NULL for this corr() case as long as
there is not enough input to create a roundoff problem also suggests
to me that that's the behavior to stick with.
So I'm now thinking that the NULL-output rule should win in cases
where they are both applicable. However, I don't know if we want
to take that so far as to say that constant-NaN input should
produce a NULL; the argument that NaN isn't really a constant
still has some force in my mind. What do you think?
> Another case to consider is this:
> SELECT corr(1.3 + x*1e-16, 1.3 + x*1e-16) FROM generate_series(1, 3) x;
> Here Sxx and Syy end up being zero, so the current code returns NULL,
> but with the patch it returns NaN (because Sxy is also zero). It's not
> quite obvious what to do in this case (the correct answer is 1, but
> there' no way of reliably computing that with double precision
> arithmetic). My first thought is that we should probably try to limit
> NaN results to NaN inputs, and so it would be better to continue to
> return NULL for cases like this where either Sxx or Syy are zero, even
> though the inputs weren't quite constant.
Good example. I had wondered whether to retain the final test for zero
Sxx/Syy, and this shows that we do need that (along with a comment
explaining that roundoff error could produce zeroes even though we
know the inputs weren't constant).
> Then there's this:
> SELECT corr(1e-100 + x*1e-105, 1e-100 + x*1e-105) FROM generate_series(1, 3) x;
> Here Sxx, Syy, and Sxy are all non-zero (roughly 2e-210), but the
> product Sxx * Syy underflows to zero. So in both HEAD and with the
> patch, this returns Infinity. That's not good, given that the
> correlation coefficient is supposed to lie in the range [-1,1].
> The correct answer in this case is also 1, which could be achieved by
> taking the square roots of Sxx and Syy separately, before multiplying,
> but it might also be sensible to clamp the result to the range [-1,1].
I like the idea of taking the square roots separately. I believe
sqrt() is a hardware operation on pretty much any machine people still
care about, and we're doing this part only once per aggregation.
So the cost seems fairly minimal.
regards, tom lane
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Tom Lane | 2025-12-03 15:12:07 | Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation |