Re: Rare SSL failures on eelpout

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rare SSL failures on eelpout
Date: 2019-03-18 20:11:30
Message-ID: 11819.1552939890@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> ... I don't like pqHandleSendFailure all that much: it has strong
> constraints on what state libpq has to be in, as a consequence of which
> it's called from a bunch of ad-hoc places, and can't be called from
> some others. It's kind of accidental that it will work here.
> I was toying with the idea of getting rid of that function in
> favor of a design like this:
> * Handle any socket-write failure at some low level of libpq by
> recording the fact that the error occurred (plus whatever data we
> need to produce a message about it), and then starting to just
> discard output data.
> * Eventually, we'll try to read. Process any available input data
> as usual (and, if we get a read error, report that). When no more
> input data is available, if a socket write failure has been recorded,
> report that much as if it were an incoming ERROR message, and then
> shut down the connection.
> This would essentially give us pqHandleSendFailure-like behavior
> across the board, which might be enough to fix this problem as well as
> bug #15598. Or not ... as you say, we haven't thoroughly understood
> either issue, so it's possible this wouldn't do it.

Here's a draft patch along that line. It's gratifyingly small, and
it does fix the SSL problem for me. I have no way of investigating
whether it fixes bug #15598, but it might.

Aside from the SSL cert-verify issue, I've checked the behavior when
the backend is shut down ("pg_ctl stop") between queries, and when
the backend is kill 9'd mid-query. The shutdown case reacts well with
or without SSL:

regression=# select 2+2;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The backend-crash case is fine without SSL:

regression=# select * from tenk1 t1, tenk1 t2, tenk1 t3;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

but a bit less nice with it:

regression=# select * from tenk1 t1, tenk1 t2, tenk1 t3;
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Succeeded.

That seems cosmetic --- we just haven't expended the same effort on
what to print for connection EOF with SSL as without it. (Also,
you get that same result without this patch.)

Of course, this hardly counts as exhaustive testing, especially since
I only tried one OpenSSL version.

> One thing that isn't real clear to me is how much timing sensitivity
> there is in "when no more input data is available". Can we assume that
> if we've gotten ECONNRESET or an allied error from a write, then any
> data the far end might've wished to send us is already available to
> read? In principle, since TCP allows shutting down either transmission
> direction independently, the server could close the read side of its
> socket while continuing to transmit data. In practice, PG servers
> don't do that; but could network timing glitches create the same effect?
> Even if it's possible, does it happen enough to worry about?
> The reason I'm concerned is that I don't think it'd be bright to ignore a
> send error until we see input EOF, which'd be the obvious way to solve a
> timing problem if there is one. If our send error was some transient
> glitch and the connection is really still open, then we might not get EOF,
> but we won't get a server reply either because no message went to the
> server. You could imagine waiting some small interval of time before
> deciding that it's time to report the write failure, but ugh.

As written, this patch does seem vulnerable to this concern, if it's real.
We could address it for post-connection failures by tweaking PQgetResult
so that it doesn't call pqWait if there's already a write error recorded;
but I'm far from sure that that would be a net improvement. Also, making
a comparable change to the behavior of the PQconnectPoll state machine
might be tricky.

My current feeling is that this is OK to put in HEAD but I think the
risk-reward ratio isn't very good for the back branches. Even with
an OpenSSL version where this makes a difference, the problematic
behavior is pretty hard to hit. So I'm a bit inclined to do nothing
in the back branches.

regards, tom lane

Attachment Content-Type Size
libpq-write-error-handling-change-1.patch text/x-diff 12.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2019-03-18 20:13:56 Re: Row Level Security − leakproof-ness and performance implications
Previous Message Robert Haas 2019-03-18 20:02:58 Re: Online verification of checksums