SSL renegotiation and other related woes

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: SSL renegotiation and other related woes
Date: 2015-01-26 10:14:05
Message-ID: 20150126101405.GA31719@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

When working on getting rid of ImmediateInterruptOK I wanted to verify
that ssl still works correctly. Turned out it didn't. But neither did it
in master.

Turns out there's two major things we do wrong:

1) We ignore the rule that once called and returning
SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
again with the same parameters. Unfortunately that rule doesn't mean
just that the same parameters have to be passed in, but also that we
can't just constantly switch between _read()/write(). Especially
nonblocking backend code (i.e. walsender) and the whole frontend code
violate this rule.

2) We start renegotiations in be_tls_write() while in nonblocking mode,
but don't properly retry to handle socket readyness. There's a loop
that retries handshakes twenty times (???), but what actually is
needed is to call SSL_get_error() and ensure that there's actually
data available.

2) is easy enough to fix, but 1) is pretty hard. Before anybody says
that 1) isn't an important rule: It reliably causes connection aborts
within a couple renegotiations. The higher the latency the higher the
likelihood of aborts. Even locally it doesn't take very long to
abort. Errors usually are something like "SSL connection has been closed
unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host
of other similar messages. There's a couple reports of those in the
archives and I've seen many more in client logs.

As far as I can see the only realistic way to fix 1) is to change both
frontend and backend code to:
a) Always check for socket read/writeability before calling
SSL_read/write() when in nonblocking mode. That's a bit annoying
because it nearly doubles the amount of syscalls we do or client
communication, but I can't really se an alternative. That allows us
to avoid waiting inside after a WANT_READ/WRITE, or havin to setup a
larger state machine that keeps track what we tried last.

b) When SSL_read/write nonetheless returns WANT_READ/WRITE, even though
we tested for read/writeability, we're very likely doing
renegotiation. In that case we'll just have to block. There's already
code that busy loops (and thus waits) in the frontend
(c.f. pgtls_read's WANT_WRITE case, triggered during reneg). We can't
just return immediately to the upper layers as we'd otherwise likely
violate the rule about calling ssl with the same parameters again.

c) Add a somewhat hacky optimization whereas we allow to break out of a
WANT_READ condition in a nonblocking socket when ssl->state ==
SSL_ST_OK. That's the cases where it actually, at least by my reading
of the unreadable ssl code, safe to not wait. That case is somewhat
important because we otherwise can end up waiting on both sides due
to b), even when nonblocking calls where actually made. That
condition essentially means that we'll only block if renegotiation or
partial reads are in progress. Afaics at least.

d) Remove the SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER hack - we don't
actually need it anymore.

These errors are much less frequent when using a plain frontend
(e.g. psql/pgbench) because they don't use copy both stuff - the way
these clients use the FE/BE protocol there's essentially natural
synchronization points where nothing but renegotiation happens. With
walsender (or pipelined queries!) both sides can write at the same time.

My testcase for this is just to setup a server with a low
ssl_renegotiation_limit, generate lots of WAL (wal.sql attached) and
receive data via pg_receivexlog -n. Usually it'll error out quickly.

I've done a preliminary implementation of the above steps and it
survives transferring 25GB of WAL via the replication protocol with a
ssl_renegotiation_limit=100kB - previously it failed much earlier.

Does anybody have a neater way to tackle this? I'm not happy about this
solution, but I really can't think of anything better (save ditching
openssl maybe). I'm willing to clean up my hacked up fix for this, but
not if we can't find agreement on the approach.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-01-26 10:16:20 Re: SSL renegotiation and other related woes
Previous Message Alexander Korotkov 2015-01-26 09:10:47 pg_dump with both --serializable-deferrable and -j