| From: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | Oleg Ivanov <o15611(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Re: BUG #19340: Wrong result from CORR() function |
| Date: | 2025-12-06 11:51:20 |
| Message-ID: | CAEZATCWCnJkCbsK69aQ-bkLyFitxakaOz=5AmzAvuS1+M507zg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
On Sat, 6 Dec 2025 at 06:09, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> >> Looking at float8_regr_accum(), I think it would be preferable to
> >> arrange for it to leave Sxx, Syy, and Sxy zero until distinct X and Y
> >> values are seen. I.e., something like this:
>
> > That seems like a good idea. I was initially worried that the extra
> > isnan() checks would slow down aggregation noticeably in the normal
> > case where we soon discover that the inputs aren't all equal.
>
> BTW, re-reading the patch, I now think we should drop the initial
>
> if (isnan(commonX) || isnan(commonY))
>
> test, instead bulling ahead with computing tmpX/tmpY/scale, and
> only skip the updates of Sxx/Syy/Sxy when we have constant inputs.
> Using that initial test is optimizing for constant inputs at the
> expense of non-constant inputs, which seems like the wrong way
> to bet.
[shrug] I doubt that it will make any noticeable performance
difference, but it will reduce the size of the diff and the final
code.
I've been thinking some more about the behaviour with NaNs, and I was
initially surprised by the difference between these 2 results:
SELECT corr(0.1, 'NaN') FROM generate_series(1, 30) g;
corr
------
(1 row)
SELECT corr('NaN', 'NaN') FROM generate_series(1, 30) g;
corr
------
NaN
(1 row)
but actually, on reflection, I think it makes sense.
The first result is required by the SQL standard, which says first
that if N * Sxx = Sx * Sx, then the result is NULL, and similarly for
y. So that rule should take precedence over any rule for NaNs.
The second result can be justified by the IEEE rules for NaNs,
according to which NaN does not equal any other number, including
itself, and so an all-NaN column is not all equal (and it doesn't
satisfy "N * Sxx = Sx * Sx" either). So the SQL standard all-the-same
rule doesn't apply to the second query, and the standard computation
yields NaN.
If we're happy with those decisions, then I think this comment should
be updated:
+ * commonX is defined as the common X value if all the X values were the same,
+ * else NaN; likewise for commonY. This is useful for deciding whether corr()
+ * and related functions should return NULL. This representation cannot
+ * distinguish all-the-values-were-NaN from the-values-weren't-all-the-same,
+ * but that's okay because we want to return NaN for all-NaN input.
because that second sentence is only true for all-NaN input in both X
and Y. Perhaps say that all-NaNs in a column is intentionally not
treated as all-the-same in that column, in accordance with IEEE.
Also, we should probably include the second query above in the regression tests.
Regards,
Dean
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-12-06 18:18:43 | Re: BUG #19340: Wrong result from CORR() function |
| Previous Message | Tom Lane | 2025-12-06 06:09:36 | Re: BUG #19340: Wrong result from CORR() function |