Re: Fix for OpenSSL error queue bug

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Zuelke <dz(at)heroku(dot)com>
Subject: Re: Fix for OpenSSL error queue bug
Date: 2016-03-11 04:16:38
Message-ID: CAM3SWZQWH=FLdZ-8Qtsc9ja-deR132_=JEoO7M=4--nZ2PAFng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 10, 2016 at 7:22 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> Arguably, if everyone followed "my" approach, this should be very easy
> to fix everywhere.

I don't think that there is any clear indication that the OpenSSL
people would share that view. Or my view. Or anything that's sensible
or practical or actionable.

> Instead, reading through the PHP bug report, they
> are coming up with a fairly complex solution for clearing the error
> queue afterwards so as to not leave "landmines" for the next guy. But
> their code will (AFAICT) still be wrong because they are not clearing
> the error *before* the API calls where it is required per documentation.
> So "everyone" (sample of 2) is scrambling to clean up for the next guy
> instead of doing the straightforward fix of following the API
> documentation and cleaning up before their own calls.

It will be less wrong, though.

PostgreSQL is the project that doesn't trust a C90 standard library
function to not blithely write passed the bounds of a passed buffer,
all because of a bug in certain versions of Solaris based systems that
was several years old at the time. See commit be8b06c364. My view that
that wasn't really worth worrying about was clearly the minority view
when this was discussed (a minority of 1, and a majority of 4 or 5,
IIRC). I think that this case vastly exceeds that standard for
worrying about other people's broken code; in this case, we ourselves
made the same mistake for years and years.

> I also see the clean-up-afterwards approach in the Python ssl module. I
> fear there is de facto a second API specification that requires you to
> clean up errors after yourself and gives an implicit guarantee that the
> error queue is empty whenever you want to make any SSL calls. I don't
> think this actually works in all cases, but maybe if everyone else is
> convinced of that (in plain violation of the published OpenSSL
> documentation, AFAICT) we need to get on board with that for
> interoperability.

I didn't know that Python's ssl module did that. That seems to lend
support to my view, which is that we should similarly clear the
thread's queue lest anyone else be affected. Yes, this approach is
fairly scatter-gun, but frankly that's just the situation we find
ourselves in.

>>> Also, there is nothing that
>>> says that an error produces exactly one entry in the error queue; it
>>> could be multiple. Or that errors couldn't arise at random times
>>> between the reset and whatever happens next.
>>
>> I think that it's kind of implied, since calling ERR_get_error() pops
>> the stack. But even if that isn't so, it might be worth preventing bad
>> things from happening to client applications only sometimes.
>
> The lore on the internet suggests that multiple errors could definitely
> happen. So popping one error afterwards isn't going to fix it, it just
> moves the edge case around.

Are you sure, specifically, that an I/O function is known to add more
than one error to the per-thread queue? Obviously there can be more
than one error in the queue. But I haven't seen any specific
indication, either in the docs or in the lore, that more than one
error can be added by a single call to an I/O function such as
SSL_read(). Perhaps you can share where you encountered the lore.

>> doesn't it give you pause? Heikki seemed to think that clearing our
>> own queue was important when he looked at this a year ago:
>>
>> http://www.postgresql.org/message-id/54EDD30D.5050107@vmware.com
>
> I think that message suggests that we should clear the queue before each
> call, not after.

Uh, it very clearly *is* Heikki's view that we should clear the queue
*afterwards*. Certainly, I think Heikki also wanted us to clear the
queue before, so we aren't screwed, "just to be sure", as he puts it
-- but nobody disputes that that's necessary anyway. That it might not
be *sufficient* to just call ERR_get_error() is the new information in
the bug report. Heikki said:

"""
<pedantic>
The OpenSSL manual doesn't directly require you to call
ERR_clear_error() before every SSL_* call. It just requires that you
ensure that the error queue is empty. Libpq ensures that by always
clearing the queue *after* an error happens, in SSLerrmessage().
</pedantic>

"""

The problem with this, as Heikki goes on to say, it that we might not
get to that point in SSLerrmessage(); we may not be able to pop the
queue/call ERR_get_error(), more or less by accident (e.g. I noticed
an OOM could do that). That's why I proposed to fix that by calling
ERR_get_error() early and unambiguously. If we must rely on that
happening, it should not be from such a long distance (i.e. from
within SSLerrmessage(), which is kind of far removed from the original
I/O function calls).

Thanks
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-03-11 04:18:55 Re: WIP: Upper planner pathification
Previous Message Robert Haas 2016-03-11 04:16:26 Re: [PROPOSAL] VACUUM Progress Checker.