PG 10: could not generate random cancel key

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PG 10: could not generate random cancel key
Date: 2018-07-17 12:33:11
Message-ID: CAEZATCXMtxbzSAvyKKk5uCRf9pNt4UV+F_5v=gLfJUuPxU4Ytg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Last week I upgraded 15 servers from various pre-10 versions to 10.4.
At first everything looked OK, but then (around 4 days later) one of
them failed with this in the logs:

2018-07-14 01:53:35.840 BST LOG: could not generate random cancel key
2018-07-14 01:53:37.233 BST LOG: could not generate random cancel key
2018-07-14 01:53:37.245 BST LOG: could not generate random cancel key
2018-07-14 01:53:38.553 BST LOG: could not generate random cancel key
2018-07-14 01:53:38.581 BST LOG: could not generate random cancel key
2018-07-14 01:54:43.851 BST WARNING: worker took too long to start; canceled
2018-07-14 01:54:43.862 BST LOG: could not generate random cancel key
2018-07-14 01:55:09.861 BST LOG: could not generate random cancel key
2018-07-14 01:55:09.874 BST LOG: could not generate random cancel key
...

After that it would not accept any new connections until I restarted
postmaster a few hours later. Since then, it has been OK.

It was built using --with-openssl and strong random support enabled,
so it was OpenSSL's RAND_bytes() that failed for some reason. I
attempted to reproduce it with a small C program directly calling
RAND_bytes(), but it refused to fail, even if I disabled haveged and
ran my tests in an @reboot cron job. So this failure is evidently
quite rare, but the documentation for RAND_bytes() says it *can* fail
(returning 0) if it isn't seeded with enough entropy, in which case
more must be added, which we're not doing.

In addition, once it does fail, repeated calls to RAND_bytes() will
continue to fail if it isn't seeded with more data -- hence the
inability to start any new backends until after a postmaster restart,
which is not a very friendly failure mode.

The OpenSSL documentation suggests that we should use RAND_status()
[1] to check that the generator has been seeded with enough data:

RAND_status() indicates whether or not the CSPRNG has been sufficiently
seeded. If not, functions such as RAND_bytes(3) will fail.

and if not, RAND_poll() can be used to fix that:

RAND_poll() uses the system's capabilities to seed the CSPRNG using
random input obtained from polling various trusted entropy sources. The
default choice of the entropy source can be modified at build time using
the --with-rand-seed configure option, see also the NOTES section. A
summary of the configure options can be displayed with the OpenSSL
version(1) command.

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Thoughts?

Regards,
Dean

[1] https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html
[2] https://github.com/nodejs/node/blob/master/src/node_crypto.cc
[3] https://github.com/openssl/openssl/issues/4148

Attachment Content-Type Size
pg_strong_random.patch text/x-patch 530 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-07-17 12:34:17 Re: Make foo=null a warning by default.
Previous Message Andrew Dunstan 2018-07-17 12:28:47 Re: [HACKERS] WAL logging problem in 9.4.3?