Re: SSL renegotiation and other related woes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SSL renegotiation and other related woes
Date: 2015-02-05 21:03:02
Message-ID: 54D3DA86.6000303@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/26/2015 12:14 PM, Andres Freund wrote:
> Hi,
>
> When working on getting rid of ImmediateInterruptOK I wanted to verify
> that ssl still works correctly. Turned out it didn't. But neither did it
> in master.
>
> Turns out there's two major things we do wrong:
>
> 1) We ignore the rule that once called and returning
> SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
> again with the same parameters. Unfortunately that rule doesn't mean
> just that the same parameters have to be passed in, but also that we
> can't just constantly switch between _read()/write(). Especially
> nonblocking backend code (i.e. walsender) and the whole frontend code
> violate this rule.

I don't buy that. Googling around, and looking at the man pages, I just
don't see anything to support that claim. I believe it is supported to
switch between reads and writes.

> 2) We start renegotiations in be_tls_write() while in nonblocking mode,
> but don't properly retry to handle socket readyness. There's a loop
> that retries handshakes twenty times (???), but what actually is
> needed is to call SSL_get_error() and ensure that there's actually
> data available.

Yeah, that's just crazy.

Actually, I don't think we need to call SSL_do_handshake() at all. At
least on modern OpenSSL versions. OpenSSL internally uses a state
machine, and SSL_read() and SSL_write() calls will complete the
handshake just as well.

> 2) is easy enough to fix, but 1) is pretty hard. Before anybody says
> that 1) isn't an important rule: It reliably causes connection aborts
> within a couple renegotiations. The higher the latency the higher the
> likelihood of aborts. Even locally it doesn't take very long to
> abort. Errors usually are something like "SSL connection has been closed
> unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host
> of other similar messages. There's a couple reports of those in the
> archives and I've seen many more in client logs.

Yeah, I can reproduce that. There's clearly something wrong.

I believe this is an OpenSSL bug. I traced the "unexpected message"
error to this code in OpenSSL's s3_read_bytes() function:

> case SSL3_RT_APPLICATION_DATA:
> /*
> * At this point, we were expecting handshake data, but have
> * application data. If the library was running inside ssl3_read()
> * (i.e. in_read_app_data is set) and it makes sense to read
> * application data at this point (session renegotiation not yet
> * started), we will indulge it.
> */
> if (s->s3->in_read_app_data &&
> (s->s3->total_renegotiations != 0) &&
> (((s->state & SSL_ST_CONNECT) &&
> (s->state >= SSL3_ST_CW_CLNT_HELLO_A) &&
> (s->state <= SSL3_ST_CR_SRVR_HELLO_A)
> ) || ((s->state & SSL_ST_ACCEPT) &&
> (s->state <= SSL3_ST_SW_HELLO_REQ_A) &&
> (s->state >= SSL3_ST_SR_CLNT_HELLO_A)
> )
> )) {
> s->s3->in_read_app_data = 2;
> return (-1);
> } else {
> al = SSL_AD_UNEXPECTED_MESSAGE;
> SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
> goto f_err;
> }

So that quite clearly throws an error, *unless* the original application
call was SSL_read(). As you also concluded, OpenSSL doesn't like it when
SSL_write() is called when it's in renegotiation. I think that's
OpenSSL's fault and it should "indulge" the application data message
even in SSL_write().

Can we work-around that easily? I believe we can. The crucial part is
that we mustn't let SSL_write() to process any incoming application
data. We can achieve that if we always call SSL_read() to drain such
data, before calling SSL_write(). But that's not quite enough; if we're
in renegotiation, SSL_write() might still try to read more data from the
socket that has arrived after the SSL_read() call. Fortunately, we can
easily prevent that by hacking pqsecure_raw_read() to always return
EWOULDBLOCK, when it's called through SSL_write().

The attached patch does that. I've been running your pg_receivexlog test
case with this for about 15 minutes without any errors now, with
ssl_renegotiation_limit=50kB, while before it errored out within seconds.

In theory, I guess we should do similar hacks in the server, and always
call SSL_read() before SSL_write(), but it seems to work without it. Or
maybe not; OpenSSL server and client code is not symmetric, so perhaps
it works in server mode without that.

PS. The server code started renegotiation when it's 1kB from reaching
ssl_renegotiation_limit. I increased that to 32kB, because in testing, I
got some "SSL failed to renegotiate connection before limit expired"
errors in testing before I did that.

PPS. I did all testing of this patch with my sleeping logic
simplification patch applied
(http://www.postgresql.org/message-id/54D3821E.9060004@vmware.com). It
applies without it too, and I don't think it matters, but I thought I'd
mention it.

- Heikki

Attachment Content-Type Size
0001-SSL-renegotiation-fixes.patch application/x-patch 6.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-02-05 21:28:15 Re: Redesigning checkpoint_segments
Previous Message Michael Paquier 2015-02-05 19:58:08 Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code