Re: rand48 replacement

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: rand48 replacement
Date: 2021-07-01 21:18:44
Message-ID: alpine.DEB.2.22.394.2107012212500.2248546@pseudo
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Dean,

> I haven't looked at the patch in detail, but one thing I object to is
> the code to choose a random integer in an arbitrary range.

Thanks for bringing up this interesting question!

> Currently, this is done in pgbench by getrand(), which has its
> problems.

Yes. That is one of the motivation for providing something hopefully
better.

> However, this patch seems to be replacing that with a simple
> modulo operation, which is perhaps the worst possible way to do it.

I did it knowing this issue. Here is why:

The modulo operation is biased for large ranges close to the limit, sure.
Also, the bias is somehow of the same magnitude as the FP multiplication
approach used previously, so the "worst" has not changed much, it is
really the same as before.

I thought it is not such an issue because for typical uses we are unlikely
to be in these conditions, so the one-operation no branching approach
seemed like a good precision vs performance compromise: I'd expect the
typical largest ranges to be well below 40 bits (eg a key in a pretty
large table in pgbench), which makes the bias well under 1/2**24 and ISTM
that I can live with that. With the initial 48 bits state, obviously the
situation was not the same.

> There's plenty of research out there on how to do it better -- see,
> for example, [1] for a nice summary.

Rejection methods include branches, thus may cost significantly more, as
show by the performance figures in blog.

Also, it somehow breaks the sequence determinism when using range, which I
found quite annoying: ISTM desirable that when generating a number the
state advances once, and just once.

Also some methods have higher costs depending on the actual range, eg the
bitmask approach: for range 129 the bitmask is 0xff and you have a nearly
50% probability of iterating once, nearly 25% of iterating twice, and so
on… I like performance to be uniform, not to depend on actual values.

Given these arguments I'd be inclined to keep the bias, but I'm open to
more discussion.

> Also, I'd say that functions to choose random integers in an arbitrary
> range ought to be part of the common API, as they are in almost every
> language's random API.

That is a good point.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2021-07-01 21:40:27 Re: SSL/TLS instead of SSL in docs
Previous Message Andrew Dunstan 2021-07-01 21:08:50 Re: make world and install-world without docs