Re: random() (was Re: New GUC to sample log queries)

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, vik(dot)fearing(at)2ndquadrant(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: random() (was Re: New GUC to sample log queries)
Date: 2018-12-27 08:36:32
Message-ID: alpine.DEB.2.21.1812270833190.32444@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello all,

> I am not sure I buy the argument that this is a security hazard, but
> there are other reasons to question the use of random() here, some of
> which you stated yourself above. Another one is that using random()
> for internal purposes interferes with applications' possible use of
> drandom() and setseed(), ie an application depending on getting a
> particular random series would see different behavior depending on
> whether this GUC is active or not.
>
> Another idea, which would be a lot less prone to breakage by
> add-on code, is to change drandom() and setseed() to themselves
> use pg_erand48() with a private seed.

My random thoughts about random, erand48, etc. which may be slightly out
of topic, sorry if this is the case.

The word "random" is a misnommer for these pseudo-random generators, so
that "strong" has to be used for higher quality generators:-(

On Linux, random() is advertised with a period of around 2**36, its
internal state is 8 to 256 bytes (default unclear, probably 8 bytes),
however seeding with srandom() provides only 32 bits, which is a drawback.

The pg_erand48 code looks like crumbs from the 70's optimized for 16 bits
architectures (which it is probably not, but why not going to 64 bits or
128 bits directly looks like a missed opportunity), its internal state is
48 bits as its name implies, and its period probably around 2**48, which
is 2**12 better than the previous case, not an extraordinary achievement.

Initial seeding of any pseudo-random generator should NEVER only use pid &
time, which are too predictable, as already noted on the thread. They
should use a strong random source if available, and maybe some backup, eg
hashing logs. I think that this should be directly implemented, maybe with
some provision to set the seed manually for debugging purposes, although
with time-dependent features that may use random I'm not sure how far this
would go.

Also, I would suggest to centralize and abstract the implementation of a
default pseudo-random generator so that its actual internal size and
quality can be changed. That would mean renaming pg_erand48 and hidding
its state size, maybe along the lines of:

// extractors
void pg_random_bytes(int nbytes, char *where_to_put_them);

uint32 pg_random_32();
uint64 pg_random_48();
uint64 pg_random_64();
...

// dynamic?
int pg_random_state_size(void); // in bytes
// or static?
#define PG_RANDOM_STATE_SIZE 6 // bytes

// get/set state
bool pg_random_get_state(uchar *state(, int size|[PG_RANDOM_STATE_SIZE]));
bool pg_random_set_state(const uchar *state...);

Given the typical hardware a postgres instance runs on, I would shop
around for a pseudo-random generator which takes advantage of 64 bits
operations, and not go below 64 bit seeds, or possibly 128.

If a strong random source is available but considered too costly, so that
a (weak) linear congruencial algorithm must be used, a possible compromise
is to reseed from the strong source every few thousands/millions draws, or
with a time periodicity, eg every few minutes, or maybe some configurable
option.

A not too costly security enhancer is to combine different fast generators
so that if one becomes weak at some point, the combination does not.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-12-27 08:52:31 Re: Offline enabling/disabling of data checksums
Previous Message Amit Langote 2018-12-27 08:29:24 Re: Speeding up creating UPDATE/DELETE generic plan for partitioned table into a lot