Re: Rare SSL failures on eelpout

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rare SSL failures on eelpout
Date: 2019-03-04 22:35:44
Message-ID: 30952.1551738944@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> It's all very confusing, but I think there's a nontrivial chance
>> that this is an OpenSSL bug, especially since we haven't been able
>> to replicate it elsewhere.

> Hmm. Yes, it is strange that we haven't seen it elsewhere, but
> remember that very few animals are running the ssl tests; also it
> requires particular timing to hit.

True. I've spent some time today running the ssl tests on various
machines here, without any luck reproducing.

> OK, here's something. I can reproduce it quite easily on this
> machine, and I can fix it like this:

> libpq_gettext("could not send startup packet: %s\n"),
> SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
> free(startpacket);
> + pqHandleSendFailure(conn);
> goto error_return;
> }

Yeah. But I don't like pqHandleSendFailure all that much: it has strong
constraints on what state libpq has to be in, as a consequence of which
it's called from a bunch of ad-hoc places, and can't be called from
some others. It's kind of accidental that it will work here.

I was toying with the idea of getting rid of that function in
favor of a design like this:

* Handle any socket-write failure at some low level of libpq by
recording the fact that the error occurred (plus whatever data we
need to produce a message about it), and then starting to just
discard output data.

* Eventually, we'll try to read. Process any available input data
as usual (and, if we get a read error, report that). When no more
input data is available, if a socket write failure has been recorded,
report that much as if it were an incoming ERROR message, and then
shut down the connection.

This would essentially give us pqHandleSendFailure-like behavior
across the board, which might be enough to fix this problem as well as
bug #15598. Or not ... as you say, we haven't thoroughly understood
either issue, so it's possible this wouldn't do it.

This would have the effect of prioritizing reports of socket read
failures over those of write failures, which is an interesting
behavioral change, but offhand I can't see a problem with it.

One thing that isn't real clear to me is how much timing sensitivity
there is in "when no more input data is available". Can we assume that
if we've gotten ECONNRESET or an allied error from a write, then any
data the far end might've wished to send us is already available to
read? In principle, since TCP allows shutting down either transmission
direction independently, the server could close the read side of its
socket while continuing to transmit data. In practice, PG servers
don't do that; but could network timing glitches create the same effect?
Even if it's possible, does it happen enough to worry about?

The reason I'm concerned is that I don't think it'd be bright to ignore a
send error until we see input EOF, which'd be the obvious way to solve a
timing problem if there is one. If our send error was some transient
glitch and the connection is really still open, then we might not get EOF,
but we won't get a server reply either because no message went to the
server. You could imagine waiting some small interval of time before
deciding that it's time to report the write failure, but ugh.

In any case, consuming immediately-available data before reporting the
write error would already be a step forward over what we're doing now,
I think.

Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-03-04 22:36:09 Re: Remove Deprecated Exclusive Backup Mode
Previous Message Karl O. Pinc 2019-03-04 22:33:47 Patch to document base64 encoding