Re: [PATCH] random_normal function

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Michael Paquier <michael(at)paquier(dot)xyz>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH] random_normal function
Date: 2023-01-09 00:20:40
Message-ID: 3983784.1673223640@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> So the problem in this patch is that it's trying to include
> utils/float.h in a src/common file, where we have not included
> postgres.h. Question is, why did you do that?

(Ah, for M_PI ... but our practice is just to duplicate that #define
where needed outside the backend.)

I spent some time reviewing this patch. I'm on board with
inventing random_normal(): the definition seems solid and
the use-case for it seems reasonably well established.
I'm not necessarily against inventing similar functions for
other distributions, but this patch is not required to do so.
We can leave that discussion until somebody is motivated to
submit a patch for one.

On the other hand, I'm much less on board with inventing
random_string(): we don't have any clear field demand for it
and the appropriate definitional details are a lot less obvious
(for example, whether it needs to be based on pg_strong_random()
rather than the random() sequence). I think we should leave that
out, and I have done so in the attached updated patch.

I noted several errors in the submitted patch. It was creating
the function as PARALLEL SAFE which is just wrong, and the whole
business with checking PG_NARGS is useless because it will always
be 2. (That's not how default arguments work.)

The business with checking against DBL_EPSILON seems wrong too.
All we need is to ensure that u1 > 0 so that log(u1) will not
choke; per spec, log() is defined for any positive input. I see that
that seems to have been modeled on the C++ code in the Wikipedia
page, but I'm not sure that C++'s epsilon means the same thing, and
if it does then their example code is just wrong. See the discussion
about "tails truncation" immediately above it: artificially
constraining the range of u1 just limits how much of the tail
of the distribution we can reproduce. So that led me to doing
it the same way as in the existing Box-Muller code in pgbench,
which I then deleted per Fabien's advice.

BTW, the pgbench code was using sin() not cos(), which I duplicated
because using cos() causes the expected output of the pgbench tests
to change. I'm not sure whether there was any hard reason to prefer
one or the other, and we can certainly change the expected output
if there's some reason to prefer cos().

I concur with not worrying about the Inf/NaN cases that Mark
pointed out. It's not obvious that the results the proposed code
produces are wrong, and it's even less obvious that anyone will
ever care.

Also, I tried running the new random.sql regression cases over
and over, and found that the "not all duplicates" test fails about
one time in 100000 or so. We could probably tolerate that given
that the random test is marked "ignore" in parallel_schedule, but
I thought it best to add one more iteration so we could knock the
odds down. Also I changed the test iterations so they weren't
all invoking random_normal() in exactly the same way.

This version seems committable to me, barring objections.

regards, tom lane

Attachment Content-Type Size
random_normal_8.patch text/x-diff 10.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-01-09 00:27:59 Re: Fixing a couple of buglets in how VACUUM sets visibility map bits
Previous Message Andres Freund 2023-01-08 23:53:09 Re: Fixing a couple of buglets in how VACUUM sets visibility map bits