Re: Rare SSL failures on eelpout

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rare SSL failures on eelpout
Date: 2019-03-04 21:59:35
Message-ID: CA+hUKGJ55XHi0ptsJQjMU=LK4kDegF9koG=AXEzvaE=wMCRFSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 5, 2019 at 10:08 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
> > Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> >> That suggests that we could perhaps handle ECONNRESET both at startup
> >> packet send time (for certificate rejection, eelpout's case) and at
> >> initial query send (for idle timeout, bug #15598's case) by attempting
> >> to read. Does that make sense?
>
> > Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> > failure means there is nothing left to *read*.
>
> After staring at the code for awhile, I am thinking that there may be
> a bug of that ilk, but if so it's down inside OpenSSL. Perhaps it's
> specific to the OpenSSL version you're using on eelpout? There is
> not anything we could do differently in libpq, AFAICS, because it's
> OpenSSL's responsibility to read any data that might be available.
>
> I also looked into the idea that we're doing something wrong on the
> server side, allowing the final error message to not get flushed out.
> A plausible theory there is that SSL_shutdown is returning a WANT_READ
> or WANT_WRITE error and we should retry it ... but that doesn't square
> with your observation upthread that it's returning SSL_ERROR_SSL.
>
> It's all very confusing, but I think there's a nontrivial chance
> that this is an OpenSSL bug, especially since we haven't been able
> to replicate it elsewhere.

Hmm. Yes, it is strange that we haven't seen it elsewhere, but
remember that very few animals are running the ssl tests; also it
requires particular timing to hit.

OK, here's something. I can reproduce it quite easily on this
machine, and I can fix it like this:

diff --git a/src/interfaces/libpq/fe-connect.c
b/src/interfaces/libpq/fe-connect.c
index f29202db5f..e9c137f1bd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2705,6 +2705,7 @@ keep_going:
/* We will come back to here until there is

libpq_gettext("could not send startup packet: %s\n"),

SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
free(startpacket);
+ pqHandleSendFailure(conn);
goto error_return;
}

If I add some printf debugging in there, I can see that block being
reached every hundred or so times I try to connect with a revoked
certificate, and with that extra call to pqHandleSendFailure() in
there the error comes out as it should:

psql: SSL error: sslv3 alert certificate revoked

Now I'm confused because we already have handling like that in
PQsendQuery(), so I can't explain bug #15598.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruno Hass 2019-03-04 21:59:37 [Proposal] TOAST'ing in slices
Previous Message Paul Ramsey 2019-03-04 21:59:25 Re: Allowing extensions to supply operator-/function-specific info