Rare SSL failures on eelpout

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Rare SSL failures on eelpout
Date: 2019-01-22 11:04:36
Message-ID: CAEepm=2n6Nv+5tFfe8YnkUm1fXgvxR0Mm1FoD+QKG-vLNGLyKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(Moving this discussion from the pgsql-committers thread "pgsql:
Update ssl test certificates and keys", which is innocent.)

On Wed, Jan 16, 2019 at 10:37 AM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Fri, Jan 4, 2019 at 10:08 AM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > Thanks. FWIW I've just updated eelpout (a Debian testing BF animal
> > that runs all the extra tests including SSL) to use libssl-dev
> > (instead of libssl1.0-dev), and cleared its accache files. Let's see
> > if that works...
>
> Since that upgrade (to libssl 1.1.1a-1), there are have been a few
> intermittent failures in the SSL tests on that animal (thanks Tom for
> pointing that out off-list).

I reproduced this manually. From time to time (less than 1% of the
times I tried), the server hangs up without sending a goodbye message,
thereby causing a failure to match the expected psql error message.
From the kernel's perspective on the psql side, a working-as-expected
case looks like this:

sendto(3, "\27\3\3\2\356\313\334\372\201\353\201h\204\353\240A\4\355\232\340\375\0X\220\326V\374\253\222i\2724"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT}])
sendto(3, "\27\3\3\0f\335\313\224\263\256r\371\215\177\273,N\345\342\200\257\r\321\6\323(at)\316\241\316\17\204\26"...,
107, MSG_NOSIGNAL, NULL, 0) = 107
ppoll([{fd=3, events=POLLIN|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLIN}])
recvfrom(3, "\27\3\3\0\23", 5, 0, NULL, NULL) = 5
recvfrom(3, "I\"\t;\3006\276\347\344]7>\2\234m\340]B\241", 19, 0,
NULL, NULL) = 19
close(3) = 0
write(2, "psql: SSL error: sslv3 alert cer"..., 49psql: SSL error:
sslv3 alert certificate revoked
) = 49

... and the unexpected failure case looks like this:

sendto(3, "\27\3\3\2\356/\254\307\277\342G?&BB\321\f\314\245\22\246U\337\263;\203f\302s\306\37\276"...,
1115, MSG_NOSIGNAL, NULL, 0) = 1115
ppoll([{fd=3, events=POLLOUT|POLLERR}], 1, NULL, NULL, 0) = 1 ([{fd=3,
revents=POLLOUT|POLLERR|POLLHUP}])
sendto(3, "\27\3\3\0f\177\335\20\305\353\234\306\211#\343\321\336\22111J\312\323F\210\6U\331\264GAr"...,
107, MSG_NOSIGNAL, NULL, 0) = -1 ECONNRESET (Connection reset by peer)
write(2, "psql: server closed the connecti"..., 199psql: server closed
the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
could not send startup packet: Connection reset by peer
) = 199

To the kernel on the server side, a good case looks like this:

[pid 13740] sendto(8,
"\27\3\3\0\23\252\21\227\2\232s\354\21\243\236\207\301\3B\341\253\306k\346",
24, 0, NULL, 0) = 24
[pid 13740] write(2, "2019-01-22 09:23:04.425 UTC [137"...,
1112019-01-22 09:23:04.425 UTC [13740] [unknown] LOG: could not
accept SSL connection: certificate verify failed
) = 111
[pid 13740] exit_group(0) = ?
[pid 13740] +++ exited with 0 +++

I didn't manage to trace a bad case on the server side (maybe the
strace interferes with the required timing), but I noticed that
SSL_shutdown()'s return code (which we don't check) is always -1, and
I noticed that SSL_get_error() after that is always SSL_ERROR_SSL. I
wondered if we might somehow be reaching exit() when the alert
response is still buffered inside OpenSSL, but that doesn't seem right
-- the port has noblock=0 at that stage.

Hmm. Why is psql doing two sendto() calls without reading a response
in between, when it's possible for the server to exit after the first,
anyway? Seems like a protocol violation somewhere?

--
Thomas Munro
http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2019-01-22 11:05:58 Re: Allowing extensions to find out the OIDs of their member objects
Previous Message Amit Khandekar 2019-01-22 10:31:58 Re: Pluggable Storage - Andres's take