Re: jacana hung after failing to acquire random number

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: jacana hung after failing to acquire random number
Date: 2016-12-12 14:01:53
Message-ID: aa5ad201-fd9e-cddd-8a24-36dee5c175b4@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/2016 03:40 PM, Andrew Dunstan wrote:
>
>
> On 12/12/2016 02:32 AM, Heikki Linnakangas wrote:
>> On 12/12/2016 05:58 AM, Michael Paquier wrote:
>>> On Sun, Dec 11, 2016 at 9:06 AM, Andrew Dunstan <andrew(at)dunslane(dot)net>
>>> wrote:
>>>>
>>>> jascana (mingw, 64 bit compiler, no openssl) is currently hung on "make
>>>> check". After starting the autovacuum launcher there are 120
>>>> messages on its
>>>> log about "Could not acquire random number". Then nothing.
>>>>
>>>>
>>>> So I suspect the problem here is commit
>>>> fe0a0b5993dfe24e4b3bcf52fa64ff41a444b8f1, although I haven't looked in
>>>> detail.
>>>>
>>>>
>>>> Shouldn't we want the postmaster to shut down if it's not going to go
>>>> further? Note that frogmouth, also mingw, which builds with openssl,
>>>> doesn't
>>>> have this issue.
>>>
>>> Did you unlock it in some way at the end? Here is the shape of the
>>> report for others:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2016-12-10%2022%3A00%3A15
>>>
>>> And here is of course the interesting bit:
>>> 2016-12-10 17:25:38.822 EST [584c80e2.ddc:2] LOG: could not acquire
>>> random number
>>> 2016-12-10 17:25:39.869 EST [584c80e2.ddc:3] LOG: could not acquire
>>> random number
>>> 2016-12-10 17:25:40.916 EST [584c80e2.ddc:4] LOG: could not acquire
>>> random number
>>>
>>> I am not seeing any problems with MSVC without openssl, so that's a
>>> problem proper to MinGW. I am getting to wonder if it is actually a
>>> good idea to cache the crypt context and then re-use it. Using a new
>>> context all the time is definitely not performance-wise though.
>>
>> Actually, looking at the config.log on jacana, it's trying to use
>> /dev/urandom:
>>
>> configure:15028: checking for /dev/urandom
>> configure:15041: result: yes
>> configure:15054: checking which random number source to use
>> configure:15073: result: /dev/urandom
>>
>> And looking closer at configure.in, I can see why:
>>
>> elif test "$PORTNAME" = x"win32" ; then
>> USE_WIN32_RANDOM=1
>>
>> That test is broken. It looks like the x"$VAR" = x"constant" idiom,
>> but the left side of the comparison doesn't have the 'x'. Oops.
>>
>> Fixed that, let's see if it made jacana happy again.
>>
>> This makes me wonder if we should work a bit harder to get a good
>> error message, if acquiring a random number fails for any reason. This
>> needs to work in the frontend as well backend, but we could still have
>> an elog(LOG, ...) there, inside an #ifndef FRONTEND block.
>
>
> I see you have now improved the messages in postmaster.c, which is good.

Well, I only wordsmithed them a bit, it still doesn't give much clue on
why it failed. We should add more details to it.

> However, the bigger problem (ISTM) is that when this failed I had a
> system which was running but where every connection immediately failed:
>
> ============== creating temporary instance ==============
> ============== initializing database system ==============
> ============== starting postmaster ==============
>
> pg_regress: postmaster did not respond within 120 seconds
> Examine c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/log/postmaster.log for the reason
> make: *** [check] Error 2
>
> Should one or more of these errors be fatal? Or should we at least get
> pg_regress to try to shut down the postmaster if it can't connect after
> 120 seconds?

Making it fatal, i.e. bringing down the server, doesn't seem like an
improvement. If the failure is transient, you don't want to kill the
whole server, when one connection attempt fails.

It would be nice to fail earlier if it's permanently failing, though.
Like, if someone does "rm /dev/urandom". Perhaps we should perform one
pg_strong_random() call at postmaster startup, and if that fails, refuse
to start up.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2016-12-12 14:02:54 Re: jacana hung after failing to acquire random number
Previous Message Robert Haas 2016-12-12 13:54:13 Re: background sessions