Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

From: David Benjamin <davidben(at)google(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions
Date: 2024-02-21 22:04:08
Message-ID: CAF8qwaBHKRNv2mJU2eMRwrM6-FuB_mkUVas47Pv_Aah9vzM3QA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the very thorough comments! I've attached a new version of the
patch.

On Thu, Feb 15, 2024 at 4:17 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2024-02-11 13:19:00 -0500, David Benjamin wrote:
> > I've attached a patch for the master branch to fix up the custom BIOs
> used
> > by PostgreSQL, in light of the issues with the OpenSSL update recently.
> > While c82207a548db47623a2bfa2447babdaa630302b9 (switching from
> BIO_get_data
> > to BIO_get_app_data) resolved the immediate conflict, I don't think it
> > addressed the root cause, which is that PostgreSQL was mixing up two BIO
> > implementations, each of which expected to be driving the BIO.
>
> Yea, that's certainly not nice - and I think we've been somewhat lucky it
> hasn't caused more issues. There's some nasty possibilities, e.g.
> sock_ctrl()
> partially enabling ktls without our initialization having called
> ktls_enable(). Right now that just means ktls is unusable, but it's not
> hard
> to imagine accidentally ending up sending unencrypted data.
>

Yeah. Even if, say, the ktls bits work, given you all care enough about I/O
to have wanted to wrap the BIO, I assume you'd want to pick up those
features on your own terms, e.g. by implementing the BIO_CTRLs yourself.

> I've in the past looked into not using a custom BIO [1], but I have my
> doubts
> about that being a good idea. I think medium term we want to be able to do
> network IO asynchronously, which seems quite hard to do when using
> openssl's
> socket BIO.

> Once we've done that, we're free to use BIO_set_data. While
> BIO_set_app_data
> > works fine, I've reverted back to BIO_set_data because it's more
> commonly used.
> > app_data depends on OpenSSL's "ex_data" mechanism, which is a tad
> heavier under
> > the hood.
>
> At first I was a bit wary of that, because it requires us to bring back the
> fallback implementation. But you're right, it's noticeably heavier than
> BIO_get_data(), and we do call BIO_get_app_data() fairly frequently.
>

TBH, I doubt it makes any real difference perf-wise. But I think
BIO_get_data is a bit more common in this context.

> > That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only
> > operation that matters is BIO_CTRL_FLUSH, which is implemented as a
> no-op. All
> > other operations are unused. It's once again good that they're unused
> because
> > otherwise OpenSSL might mess with postgres's socket, break the
> assumptions
> > around interrupt handling, etc.
>
> How did you determine that only FLUSH is required? I didn't even really
> find
> documentation about what the intended semantics actually are.
>

The unhelpful answer is that my day job is working on BoringSSL, so I've
spent a lot of time with this mess. :-) But, yeah, it's not well-documented
at all. OpenSSL ends up calling BIO_flush at the end of each batch of
writes in libssl. TBH, I suspect that was less intentional and more an
emergent property of them internally layering a buffer BIO at one point in
the process, but it's long been part of the (undocumented) API contract.
Conversely, I don't think OpenSSL can possibly make libssl *require* a new
BIO_CTRL because they'd break every custom BIO anyone has ever used with
the library.

> E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so
> far, because we never set it, but is that right?

Ah hmm, BIO_CTRL_EOF is... a bit of a mess. OpenSSL kind of messed things
up. So, up until recently, I would have said that BIO_CTRL_EOF was not part
of the interface here. OpenSSL 1.0.x did not implement it for sockets, and
the BIO_read API *already* had a way to signal EOF: did you return zero or
-1?

Then OpenSSL 1.1.x introduced size_t-based BIO_read_ex APIs. However, in
the process, they *forgot that EOF and error are different things* and made
it impossible to detect EOFs if you use BIO_read_ex! They never noticed
this, because they didn't actually convert their own code to their new API.
See this discussion, which alas ended with OpenSSL deciding to ignore the
problem and not even document their current interface.
https://github.com/openssl/openssl/issues/8208

Though they never responded, they seem to have tacitly settled using the
out-of-band BIO_eof function (which is BIO_CTRL_EOF) as the way to signal
EOF for BIO_read_ex. This is kind of fiddly, but is at least a well-defined
option. But the problem is no one's BIO_METHODs, including their own, are
read_ex-based. They all implement the old read callback. But someone might
call BIO_read_ex on a read-based BIO_METHOD. IMO, BIO_read_ex should be
responsible for translating between the two EOF conventions. For example,
it could automatically set a flag when the read callback returns 0 and then
make BIO_ctrl check the flag and automatically implement BIO_CTRL_EOF for
BIOs that don't do it themselves. Alas, OpenSSL did not do this and instead
they just made their built-in BIOs implement BIO_CTRL_EOF, even though this
new expectation is a compatibility break. That leaves BIO_read, the one
everyone uses, in a pretty ambiguous state that they seem uninterested in
addressing.
https://github.com/openssl/openssl/pull/10882
https://github.com/openssl/openssl/pull/10907

Honestly, I suspect nothing in postgres actually cares about this, given
you all didn't notice things breaking around here in the early days of
1.1.x. Still, that is a good point. I've updated the patch to implement
BIO_CTRL_EOF for completeness' sake.

(For BoringSSL, we did not go down this route because, unlike OpenSSL
apparently, we do not think breaking backwards compatibility in the EOF
convention is OK. I do want to add the size_t APIs eventually, but I'm
thinking we'll do the automatic BIO_CTRL_EOF thing described above, to
avoid breaking compatibility with existing BIO_METHODs.)

Beyond BIO_CTRL_EOF, I assume what you're really asking here is "how do we
know OpenSSL won't change the interface again?". And, honestly, we don't.
They promise API and ABI stability, which in theory means they shouldn't.
But their track record with custom BIOs is not great. That said, if they do
break it, I think it will unambiguously be an API break on their part and
hopefully it'll be possible to get them to fix the issue. The EOF issue
snuck in because it mostly only impacted people who tried to migrate to
their new APIs. It broke existing APIs too, but the one project that
noticed (CPython) misdiagnosed the issue and worked around it.
(Incorrectly, but no one noticed. See
https://github.com/python/cpython/pull/95495.) So, I raised the issue,
they'd long since shipped it and evidently felt no burning need to fix
their regression or unclear APIs. :-(

But the alternative, picking up the real socket BIO's ctrl function, is
even more unsound, so I think this is the better place to be.

> What about
> BIO_CTRL_GET_CLOSE/BIO_CTRL_SET_CLOSE?
>

The close flag is pretty silly. All BIOs do slightly different things with
it, which means that libssl cannot possibly call it generically. So if your
BIO doesn't have any need for it, you don't need to bother. It's usually
used to indicate whether the BIO owns the underlying fd/socket/BUF_MEM/etc,

> Another issue is that 0 doesn't actually seem like the universal error
> return
> - e.g. BIO_C_GET_FD seems to return -1, because 0 is a valid fd.
>

Yeah, that is... also a mess. All of OpenSSL's ctrl functions return zero
for unrecognized commands. It seems they just don't have a convention for
distinguishing unimplemented commands from zero returns. As you note, this
is a problem for BIO_C_GET_FD. OpenSSL's other non-fd BIOs have the same
problem though:
https://github.com/openssl/openssl/blob/master/crypto/bio/bss_file.c#L340

Realistically, type-specific ioctls like that don't matter much because
you're really only going to call them if you already know you have an
applicable BIO. Hopefully, between that, and removing BIO_TYPE_DESCRIPTOR
(below), it's mostly okay? Also happy to add a `ret = -1` implementation
of BIO_C_GET_FD if you'd prefer. But, in the general case, I suspect we
just have to live with the zero thing, and hopefully OpenSSL will stop
introducing ctrls where zero is an unsafe return value.

Perhaps BIO_ctrl in OpenSSL should have some logic like...
```
if (!(type & BIO_TYPE_DESCRIPTOR) && cmd == BIO_C_GET_FD) {
return -1;
}
```
Although I suspect there are people who implement BIO_C_GET_FD without
setting BIO_TYPE_DESCRIPTOR because none of this was ever documented.

> As of your patch the bio doesn't actually have an FD anymore, should it
> still
> set BIO_TYPE_DESCRIPTOR?
>

Ah good call. Done. I don't think much code really cares, but it does mean
that, if you all call SSL_get_fd (why would you?), it won't get confused.
:-)

> +static long
> > +my_sock_ctrl(BIO *h, int cmd, long num, void *ptr)
> > +{
> > + long res = 0;
> > +
> > + switch (cmd)
> > + {
> > + case BIO_CTRL_FLUSH:
> > + /* libssl expects all BIOs to support BIO_flush. */
> > + res = 1;
> > + break;
> > + }
> > +
> > + return res;
> > +}
>
> I'd move the res = 0 into a default: block. That way the compiler can warn
> if
> some case doesn't set it in all branches.
>

Done.

> > static BIO_METHOD *
> > my_BIO_s_socket(void)
> > {
>
> Wonder if we should rename this. It's pretty odd that we still call it's
> not
> really related to s_socket anymore, and doesn't even really implement the
> same
> interface (e.g. get_fd doesn't work anymore). Similarly, my_SSL_set_fd()
> doesn't actually call set_fd() anymore, which sure seems odd.
>

Done. I wasn't sure what to name them, or even what the naming convention
was, but I opted to name them after the Port and PGconn objects they wrap.
Happy to switch to another name if you have a better idea though.

David

Attachment Content-Type Size
avoid-mixing-custom-and-openssl-bios-v2.patch application/x-patch 17.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2024-02-21 22:19:26 Re: Patch: Add parse_type Function
Previous Message David E. Wheeler 2024-02-21 21:49:35 Re: Patch: Add parse_type Function