Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

From: David Benjamin <davidben(at)google(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
Date: 2024-04-19 13:13:45
Message-ID: CAF8qwaCCrM-Zj6x1CO2FsLkj4v6oppWvB+BQngEQo1f5T-J=HA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Circling back here, anything else needed from my end on this patch?

On Wed, Feb 21, 2024, 17:04 David Benjamin <davidben(at)google(dot)com> wrote:

> Thanks for the very thorough comments! I've attached a new version of the
> patch.
>
> On Thu, Feb 15, 2024 at 4:17 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> Hi,
>>
>> On 2024-02-11 13:19:00 -0500, David Benjamin wrote:
>> > I've attached a patch for the master branch to fix up the custom BIOs
>> used
>> > by PostgreSQL, in light of the issues with the OpenSSL update recently.
>> > While c82207a548db47623a2bfa2447babdaa630302b9 (switching from
>> BIO_get_data
>> > to BIO_get_app_data) resolved the immediate conflict, I don't think it
>> > addressed the root cause, which is that PostgreSQL was mixing up two BIO
>> > implementations, each of which expected to be driving the BIO.
>>
>> Yea, that's certainly not nice - and I think we've been somewhat lucky it
>> hasn't caused more issues. There's some nasty possibilities, e.g.
>> sock_ctrl()
>> partially enabling ktls without our initialization having called
>> ktls_enable(). Right now that just means ktls is unusable, but it's not
>> hard
>> to imagine accidentally ending up sending unencrypted data.
>>
>
> Yeah. Even if, say, the ktls bits work, given you all care enough about
> I/O to have wanted to wrap the BIO, I assume you'd want to pick up those
> features on your own terms, e.g. by implementing the BIO_CTRLs yourself.
>
>
>> I've in the past looked into not using a custom BIO [1], but I have my
>> doubts
>> about that being a good idea. I think medium term we want to be able to do
>> network IO asynchronously, which seems quite hard to do when using
>> openssl's
>> socket BIO.
>
>
>
> > Once we've done that, we're free to use BIO_set_data. While
>> BIO_set_app_data
>> > works fine, I've reverted back to BIO_set_data because it's more
>> commonly used.
>> > app_data depends on OpenSSL's "ex_data" mechanism, which is a tad
>> heavier under
>> > the hood.
>>
>> At first I was a bit wary of that, because it requires us to bring back
>> the
>> fallback implementation. But you're right, it's noticeably heavier than
>> BIO_get_data(), and we do call BIO_get_app_data() fairly frequently.
>>
>
> TBH, I doubt it makes any real difference perf-wise. But I think
> BIO_get_data is a bit more common in this context.
>
>
>> > That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only
>> > operation that matters is BIO_CTRL_FLUSH, which is implemented as a
>> no-op. All
>> > other operations are unused. It's once again good that they're unused
>> because
>> > otherwise OpenSSL might mess with postgres's socket, break the
>> assumptions
>> > around interrupt handling, etc.
>>
>> How did you determine that only FLUSH is required? I didn't even really
>> find
>> documentation about what the intended semantics actually are.
>>
>
> The unhelpful answer is that my day job is working on BoringSSL, so I've
> spent a lot of time with this mess. :-) But, yeah, it's not well-documented
> at all. OpenSSL ends up calling BIO_flush at the end of each batch of
> writes in libssl. TBH, I suspect that was less intentional and more an
> emergent property of them internally layering a buffer BIO at one point in
> the process, but it's long been part of the (undocumented) API contract.
> Conversely, I don't think OpenSSL can possibly make libssl *require* a
> new BIO_CTRL because they'd break every custom BIO anyone has ever used
> with the library.
>
>
>> E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so
>> far, because we never set it, but is that right?
>
>
> Ah hmm, BIO_CTRL_EOF is... a bit of a mess. OpenSSL kind of messed things
> up. So, up until recently, I would have said that BIO_CTRL_EOF was not part
> of the interface here. OpenSSL 1.0.x did not implement it for sockets, and
> the BIO_read API *already* had a way to signal EOF: did you return zero
> or -1?
>
> Then OpenSSL 1.1.x introduced size_t-based BIO_read_ex APIs. However, in
> the process, they *forgot that EOF and error are different things* and
> made it impossible to detect EOFs if you use BIO_read_ex! They never
> noticed this, because they didn't actually convert their own code to their
> new API. See this discussion, which alas ended with OpenSSL deciding to
> ignore the problem and not even document their current interface.
> https://github.com/openssl/openssl/issues/8208
>
> Though they never responded, they seem to have tacitly settled using the
> out-of-band BIO_eof function (which is BIO_CTRL_EOF) as the way to signal
> EOF for BIO_read_ex. This is kind of fiddly, but is at least a well-defined
> option. But the problem is no one's BIO_METHODs, including their own, are
> read_ex-based. They all implement the old read callback. But someone might
> call BIO_read_ex on a read-based BIO_METHOD. IMO, BIO_read_ex should be
> responsible for translating between the two EOF conventions. For example,
> it could automatically set a flag when the read callback returns 0 and then
> make BIO_ctrl check the flag and automatically implement BIO_CTRL_EOF for
> BIOs that don't do it themselves. Alas, OpenSSL did not do this and instead
> they just made their built-in BIOs implement BIO_CTRL_EOF, even though this
> new expectation is a compatibility break. That leaves BIO_read, the one
> everyone uses, in a pretty ambiguous state that they seem uninterested in
> addressing.
> https://github.com/openssl/openssl/pull/10882
> https://github.com/openssl/openssl/pull/10907
>
> Honestly, I suspect nothing in postgres actually cares about this, given
> you all didn't notice things breaking around here in the early days of
> 1.1.x. Still, that is a good point. I've updated the patch to implement
> BIO_CTRL_EOF for completeness' sake.
>
> (For BoringSSL, we did not go down this route because, unlike OpenSSL
> apparently, we do not think breaking backwards compatibility in the EOF
> convention is OK. I do want to add the size_t APIs eventually, but I'm
> thinking we'll do the automatic BIO_CTRL_EOF thing described above, to
> avoid breaking compatibility with existing BIO_METHODs.)
>
> Beyond BIO_CTRL_EOF, I assume what you're really asking here is "how do we
> know OpenSSL won't change the interface again?". And, honestly, we don't.
> They promise API and ABI stability, which in theory means they shouldn't.
> But their track record with custom BIOs is not great. That said, if they do
> break it, I think it will unambiguously be an API break on their part and
> hopefully it'll be possible to get them to fix the issue. The EOF issue
> snuck in because it mostly only impacted people who tried to migrate to
> their new APIs. It broke existing APIs too, but the one project that
> noticed (CPython) misdiagnosed the issue and worked around it.
> (Incorrectly, but no one noticed. See
> https://github.com/python/cpython/pull/95495.) So, I raised the issue,
> they'd long since shipped it and evidently felt no burning need to fix
> their regression or unclear APIs. :-(
>
> But the alternative, picking up the real socket BIO's ctrl function, is
> even more unsound, so I think this is the better place to be.
>
>
>> What about
>> BIO_CTRL_GET_CLOSE/BIO_CTRL_SET_CLOSE?
>>
>
> The close flag is pretty silly. All BIOs do slightly different things with
> it, which means that libssl cannot possibly call it generically. So if your
> BIO doesn't have any need for it, you don't need to bother. It's usually
> used to indicate whether the BIO owns the underlying fd/socket/BUF_MEM/etc,
>
>
>> Another issue is that 0 doesn't actually seem like the universal error
>> return
>> - e.g. BIO_C_GET_FD seems to return -1, because 0 is a valid fd.
>>
>
> Yeah, that is... also a mess. All of OpenSSL's ctrl functions return zero
> for unrecognized commands. It seems they just don't have a convention for
> distinguishing unimplemented commands from zero returns. As you note, this
> is a problem for BIO_C_GET_FD. OpenSSL's other non-fd BIOs have the same
> problem though:
> https://github.com/openssl/openssl/blob/master/crypto/bio/bss_file.c#L340
>
> Realistically, type-specific ioctls like that don't matter much because
> you're really only going to call them if you already know you have an
> applicable BIO. Hopefully, between that, and removing BIO_TYPE_DESCRIPTOR
> (below), it's mostly okay? Also happy to add a `ret = -1` implementation
> of BIO_C_GET_FD if you'd prefer. But, in the general case, I suspect we
> just have to live with the zero thing, and hopefully OpenSSL will stop
> introducing ctrls where zero is an unsafe return value.
>
> Perhaps BIO_ctrl in OpenSSL should have some logic like...
> ```
> if (!(type & BIO_TYPE_DESCRIPTOR) && cmd == BIO_C_GET_FD) {
> return -1;
> }
> ```
> Although I suspect there are people who implement BIO_C_GET_FD without
> setting BIO_TYPE_DESCRIPTOR because none of this was ever documented.
>
>
>> As of your patch the bio doesn't actually have an FD anymore, should it
>> still
>> set BIO_TYPE_DESCRIPTOR?
>>
>
> Ah good call. Done. I don't think much code really cares, but it does mean
> that, if you all call SSL_get_fd (why would you?), it won't get confused.
> :-)
>
> > +static long
>> > +my_sock_ctrl(BIO *h, int cmd, long num, void *ptr)
>> > +{
>> > + long res = 0;
>> > +
>> > + switch (cmd)
>> > + {
>> > + case BIO_CTRL_FLUSH:
>> > + /* libssl expects all BIOs to support BIO_flush.
>> */
>> > + res = 1;
>> > + break;
>> > + }
>> > +
>> > + return res;
>> > +}
>>
>> I'd move the res = 0 into a default: block. That way the compiler can
>> warn if
>> some case doesn't set it in all branches.
>>
>
> Done.
>
>
>> > static BIO_METHOD *
>> > my_BIO_s_socket(void)
>> > {
>>
>> Wonder if we should rename this. It's pretty odd that we still call it's
>> not
>> really related to s_socket anymore, and doesn't even really implement the
>> same
>> interface (e.g. get_fd doesn't work anymore). Similarly, my_SSL_set_fd()
>> doesn't actually call set_fd() anymore, which sure seems odd.
>>
>
> Done. I wasn't sure what to name them, or even what the naming convention
> was, but I opted to name them after the Port and PGconn objects they wrap.
> Happy to switch to another name if you have a better idea though.
>
> David
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-04-19 13:17:10 Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
Previous Message Daniel Gustafsson 2024-04-19 12:59:16 Re: Okay to remove mention of mystery @ and ~ operators?