libpq SSL with non-blocking sockets

From: Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: libpq SSL with non-blocking sockets
Date: 2011-06-09 14:39:33
Message-ID: 4DF0DB25.9050105@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I believe I have found a bug in libpq COPY handling with non-blocking
SSL connections. The bug manifests itself by dropping the connection
in PGputCopyData() with "server closed the connection unexpectedly"
message. The connection drop only occurs with nonblocking connections -
blocking connections work as expected.

I'll skip a lot of debugging details, but what I found out was that
the connection drop was caused by a SSL "bad write retry" from
SSL_write(). Which in turn appears to be caused by bad data passed to
SSL_write() retry. So, what happened was that at some point SSL_write()
returned SSL_ERROR_WANT_WRITE meaning that the write needs to be
retried. Now, according to OpenSSL manual the retry SSL_write needs to
be passed the same arguments as the previous call:
http://www.openssl.org/docs/ssl/SSL_write.html#WARNING

However the next SSL_write was passed another buffer with a different
length and it failed with SSL_ERROR_SSL "bad write retry". Which in
turn caused the pqSendSome to drop the connection and failed the
PQputCopyData. Actually, now that I think of it, this is probably not
only related to COPY handling ...

The connection drop can be reproduced by introducing some "network
problems" to the connection, so that the SSL_write needs to be retried.
This can be most easily accomplished by passing a huge buffer to
PQputCopyData. For instance on my laptop I can do about 3 PQputCopyData
with a 100MB buffer before the connection is dropped. Others results may
vary. Also, it doesn't seem to matter what the libpq version is
used - I initially started on 8.3, but later confirmed that the same
problem exists on 8.4, 9.0 and 9.1. Tested with OpenSSL 0.9.8g and
0.9.8o, no idea about other versions.

The following is a log excerpt from my test program (attached) with some
extra logging added to libpq:

PQputCopyData #0 buf:100000000 total:0
pqFlush()
pqFlush() = 0
pqPutMsgEnd()
pqSendSome(99999744)
SSL_write(0x7e59d60, 0x7ff5a7eac010, 99999744) = 99999744:
SSL_error: 0, errno: 0
pqPutMsgEnd() = 0
PQputCopyData #1 buf:100000000 total:100000000
pqPutMsgEnd()
pqSendSome(99999744)
SSL_write(0x7e59d60, 0x7ff5a7eac010, 99999744) = -1: SSL_error: 3,
errno: 11
pqPutMsgEnd() = 0
PQputCopyData #2 buf:100000000 total:200000000
pqFlush()
pqSendSome(100000266)
SSL_write(0x7e59d60, 0x7ff5a7eac010, 100000266) = -1: SSL_error:
3, errno: 11
pqFlush() = 1
pqPutMsgEnd()
pqSendSome(199999488)
SSL_write(0x7e59d60, 0x7ff597eab010, 199999488) = -1: SSL_error:
1, errno: 0
SSL error: bad write retry
pqPutMsgEnd() = -1
PQputCopyData() = -1: server closed the connection unexpectedly

In PQputCopyData #2 it is visible that the first SSL_write called from
pqFlush failed with SSL_ERROR_WANT_WRITE. The next SSL_write should
have been a retry with the same parameters, but instead was passed a
buffer with a different address and length. Hence the "bad write
retry". Some googling turned out similar issues for other projects
using SSL with non-blocking sockets.

The possible workarounds are to disable SSL or to disable non-blocking
libpq connections. Both are not always possible - security reasons, 3rd
party applications, drivers, etc. So I think this should be fixed in
libpq. Not sure exactly how though. It would seem that for the
PQputCopyData the best would be to return 0 to indicate that the
operation should be retried. No idea for the other possible cases of
SSL_write() retry though. What do you think?

regards,
Martin

Attachment Content-Type Size
copy_test.c text/x-csrc 2.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2011-06-09 14:40:18 Re: .gitignore for some of cygwin files
Previous Message Tom Lane 2011-06-09 14:38:59 Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS