could not receive data from WAL stream: SSL SYSCALL error: Success

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: could not receive data from WAL stream: SSL SYSCALL error: Success
Date: 2017-11-15 10:46:43
Message-ID: CAEepm=3cc5wYv=X4Nzy7VOUkdHBiJs9bpLzqtqJWxdDUp5DiPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I heard a report of an error like this from a user of openssl
1.1.0f-3+deb9u on Debian:

pg_basebackup: could not receive data from WAL stream: SSL SYSCALL
error: Success

I noticed that some man pages for SSL_get_error say this under
SSL_ERROR_SYSCALL:

Some non-recoverable I/O error occurred. The OpenSSL error queue
may contain more information on the error. For socket I/O on Unix
systems, consult errno for details.

But others say:

Some I/O error occurred. The OpenSSL error queue may contain more
information on the error. If the error queue is empty (i.e. ERR_get_error()
returns 0), ret can be used to find out more about the error: If ret == 0,
an EOF was observed that violates the protocol. If ret == -1, the underlying
BIO reported an I/O error (for socket I/O on Unix systems, consult errno for
details).

While wondering if it was the documentation or the behaviour that
changed and what it all means, I came across some discussion and a
reverted commit here:

https://github.com/openssl/openssl/issues/1903

The error reported to me seems to have occurred on a release whose man
page *doesn't* describe the ERR_get_error() == 0 case (unlike some of
the earlier tags you can get to from here):

https://github.com/openssl/openssl/blob/OpenSSL_1_1_0-stable/doc/ssl/SSL_get_error.pod

And yet clearly errno didn't hold an error number from a failed
syscall, which seems consistent with the older documented behaviour.

Perhaps pgtls_read(), pgtls_write() and open_client_SSL() should add
"&& ecode != 0" to the if statements in their SSL_ERROR_SYSCALL case
so that this case would fall to the "EOF detected" message instead of
logging the nonsensical (and potentially uninitialised?) errno
message, if indeed this is behaviour described in older releases. On
the other hand, without documentation to support it in the current
release, we don't really *know* that it's an EOF condition. Due to
this murkiness and the fact that it's mostly harmless anyway, I'm not
proposing a change, but I thought I'd share this in case it makes more
sense to someone more familiar with this stuff.

--
Thomas Munro
http://www.enterprisedb.com

Browse pgsql-hackers by date

  From Date Subject
Next Message Huong Dangminh 2017-11-15 10:55:39 RE: User defined data types in Logical Replication
Previous Message Andreas Joseph Krogh 2017-11-15 10:45:40 Sv: pspg - psql pager