Re: Support for NSS as a libpq TLS backend

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Jacob Champion <pchampion(at)vmware(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "andrew(dot)dunstan(at)2ndquadrant(dot)com" <andrew(dot)dunstan(at)2ndquadrant(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "thomas(dot)munro(at)gmail(dot)com" <thomas(dot)munro(at)gmail(dot)com>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>
Subject: Re: Support for NSS as a libpq TLS backend
Date: 2022-01-26 20:39:16
Message-ID: 4AAEE9FC-A76E-4E6E-874B-2136C487753E@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 23 Jan 2022, at 22:20, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-01-18 13:42:54 +0100, Daniel Gustafsson wrote:

Thanks heaps for the review, much appreciated!

>> + install_script: |
>> + DEBIAN_FRONTEND=noninteractive apt-get --yes install libnss3 libnss3-dev libnss3-tools libnspr4 libnspr4-dev
>
> This needs an apt-get update beforehand to succeed. That's what caused the last few runs
> to fail, see e.g.
> https://cirrus-ci.com/task/6293612580306944

Ah, good point. Adding that made it indeed work.

> Just duplicating the task doesn't really scale once in tree.

Totally agree. This was mostly a hack to see if I could make the CFBot build a
tailored build, then life threw school closures etc at me and I sort of forgot
about removing it again.

> What about
> reconfiguring (note: add --enable-depend) the linux tasks to build against
> nss, and then run the relevant subset of tests with it? Most tests don't use
> tcp / SSL anyway, so rerunning a small subset of tests should be feasible?

That's an interesting idea, I think that could work and be reasonably readable
at the same time (and won't require in-depth knowledge of Cirrus). As it's the
same task it does spend more time towards the max runtime per task, but that's
not a problem for now. It's worth keeping in mind though if we deem this to be
a way forward with testing multiple settings.

>> From 297ee9ab31aa579e002edc335cce83dae19711b1 Mon Sep 17 00:00:00 2001
>> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
>> Date: Mon, 8 Feb 2021 23:52:22 +0100
>> Subject: [PATCH v52 01/11] nss: Support libnss as TLS library in libpq
>
>> 16 files changed, 3192 insertions(+), 7 deletions(-)
>
> Phew. This is a huge patch.

Yeah =/ .. without going beyond and inventing new things on top what is needed
to replace OpenSSL, a lot of code (and tests) has to be written. If nothing
else, this work at least highlights just how much we've come to use OpenSSL.

> Damn, I only opened this thread to report the CI failure. But now I ended up
> doing a small review...

Thanks! Next time we meet, I owe you a beverage of choice.

>> +#include "common/nss.h"
>> +
>> +/*
>> + * The nspr/obsolete/protypes.h NSPR header typedefs uint64 and int64 with
>> + * colliding definitions from ours, causing a much expected compiler error.
>> + * Remove backwards compatibility with ancient NSPR versions to avoid this.
>> + */
>> +#define NO_NSPR_10_SUPPORT
>> +#include <nspr.h>
>> +#include <prerror.h>
>> +#include <prio.h>
>> +#include <prmem.h>
>> +#include <prtypes.h>
>
> Duplicated with nss.h. Which brings me to:

Fixed, there and elsewhere.

>> +#include <nss.h>
>
> Is it a great idea to have common/nss.h when there's a library header nss.h?
> Perhaps we should have a pg_ssl_{nss,openssl}.h or such?

That's a good point, I modelled it after common/openssl.h but I agree it's
better to differentiate the filenames. I've renamed it to common/pg_nss.h and
we should IMO rename common/openssl.h regardless of what happens to this patch.

>> +/* ------------------------------------------------------------ */
>> +/* Public interface */
>> +/* ------------------------------------------------------------ */
>
> Nitpicks:
> I don't think we typically do multiple /* */ comments in a row for this type
> of thing. I also don't particularly like centering things like this, tends to
> get inconsistent across comments.

This is just a copy/paste from be-secure-openssl.c, but I'm far from married to
it so happy to remove. Fixed.

>> +/*
>> + * be_tls_open_server
>> + *
>> + * Since NSPR initialization must happen after forking, most of the actual
>> + * setup of NSPR/NSS is done here rather than in be_tls_init.
>
> The "Since ... must happen after forking" sounds like it's referencing a
> previously remarked upon fact. But I don't see anything but a copy of this
> comment.

NSS contexts aren't fork safe, IIRC it's around its use of file descriptors.
Fairly old NSS documentation and mailing list posts cite hardware tokens (which
was a very strong focus in the earlier days of NSS) not being safe to use across
forks and thus none of NSS was ever intended to be initialized until after the
fork. I've reworded this comment a bit to make that clearer.

> Does this make some things notably more expensive? Presumably it does remove a
> bunch of COW opportunities, but likely that's not a huge factor compared to
> assymetric crypto negotiation...

Right, the context of setting up crypto across a network connection it's highly
likely to drown out the costs.

> Maybe soem of this commentary should migrate to the file header or such?

Maybe, or perhaps README.ssl? Not sure where it would be most reasonable to
keep it such that it's also kept up to date.

>> This introduce
>> + * differences with the OpenSSL support where some errors are only reported
>> + * at runtime with NSS where they are reported at startup with OpenSSL.
>
> Found this sentence hard to parse somehow.
>
> It seems pretty unfriendly to only have minimal error checking at postmaster
> startup time. Seems at least the presence and usability of keys should be done
> *also* at that time?

I'll look at adding some setup, and subsequent teardown, of NSS at startup
during which we could do checking to be more on par with how the OpenSSL
backend will report errors.

>> + /*
>> + * If no ciphers are specified, enable them all.
>> + */
>> + if (!SSLCipherSuites || strlen(SSLCipherSuites) == 0)
>> + {
>> + ...
>> + if (status != SECSuccess)
>> + {
>> + ereport(COMMERROR,
>> + (errmsg("invalid cipher-suite specified: %s", c)));
>> + return -1;
>
> It likely doesn't matter much because the backend will exit, but because
> COMERROR doesn't throw, it seems like this will leak "ciphers"?

Agreed, it won't matter much in practice but we should clearly pfree it, fixed.

>> + pfree(ciphers);
>> +
>> + if (!found)
>> + {
>> + ereport(COMMERROR,
>> + (errmsg("no cipher-suites found")));
>> + return -1;
>> + }
>> + }
>
> Seems like this could reasonably done in a separate function?

Agreed, trimming the length of an already very long function is a good idea.
Fixed.

> I assume PR_GetError() is some thread-local construct, given it's also used in
> libpq?

Correct.

> Why, oh why, do people copy the abysmal "global errno" approach everywhere.

Even better, NSPR has two of them: PR_GetError and PR_GetOSError (the latter
isn't used in this implementation, but it could potentially be added to error
paths on NSS_InitContext and other calls that read off the filesystem).

>> +ssize_t
>> +be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
>> +{
>
> I'm not a fan of duplicating the symbol names between be-secure-openssl.c and
> this. For one it's annoying for source code naviation. It also seems that at
> some point we might want to be able to link against both at the same time?
> Maybe we should name them unambiguously and then use some indirection in a
> header somewhere?

We could do that, and that's something that we could do independently of this
patch to keep the scope down. Doing it in master now with just the OpenSSL
implementation as a consumer would be a logical next step in the TLS library
abstraction we've done.

>> + PR_Close(port->pr_fd);
>> + port->pr_fd = NULL;
>
> What if we failed before initializing pr_fd?

Fixed.

>> + /*
>> + * Since there is no password callback in NSS when the server starts up,
>> + * it makes little sense to create an interactive callback. Thus, if this
>> + * is a retry attempt then give up immediately.
>> + */
>> + if (retry)
>> + return NULL;
>
> That's really not great. Can't we do something like initialize NSS in
> postmaster, load the key into memory, including prompting, and then shut nss
> down again?

I can look at doing something along those lines. It does require setting up a
fair bit of infrastructure but if the code refactored to allow reuse it can
probably be done fairly readable.

>> +/*
>> + * raw_subject_common_name
>> + *
>> + * Returns the Subject Common Name for the given certificate as a raw char
>> + * buffer (that is, without any form of escaping for unprintable characters or
>> + * embedded nulls), with the length of the buffer returned in the len param.
>> + * The buffer is allocated in the TopMemoryContext and is given a NULL
>> + * terminator so that callers are safe to call strlen() on it.
>> + *
>> + * This is used instead of CERT_GetCommonName(), which always performs quoting
>> + * and/or escaping. NSS doesn't appear to give us a way to easily unescape the
>> + * result, and we need to store the raw CN into port->peer_cn for compatibility
>> + * with the OpenSSL implementation.
>> + */
>
> Do we have a testcase for embedded NULLs in common names?

We don't, neither for OpenSSL or NSS. AFAICR Jacob spent days trying to get a
certificate generation to include an embedded NULL byte but in the end gave up.
We would have to write our own tools for generating certificates to add that
(which may or may not be a bad idea, but it hasn't been done).

>> + /* Found a CN, decode and copy it into a newly allocated buffer */
>> + buf = CERT_DecodeAVAValue(&(*ava)->value);
>> + if (!buf)
>> + {
>> + /*
>> + * This failure case is difficult to test. (Since this code
>> + * runs after certificate authentication has otherwise
>> + * succeeded, you'd need to convince a CA implementation to
>> + * sign a corrupted certificate in order to get here.)
>
> Why is that hard with a toy CA locally? Might not be worth the effort, but if
> the comment explicitly talks about it being hard...

The gist of this comment is that it's hard to do with a stock local CA. I've
added a small blurb to clarify that a custom implementation would be required.

>> +/*
>> + * pg_SSLShutdownFunc
>> + * Callback for NSS shutdown
>> + *
>> + * If NSS is terminated from the outside when the connection is still in use
>
> What does "NSS is terminated from the outside when the connection" really
> mean? Does this mean the client initiating something?

If an extension, or other server-loaded code, interfered with NSS and managed
to close contexts in order to interfere with connections this would ensure us
closing it down cleanly.

That being said, I was now unable to get my old testcase working so I've for
now removed this callback from the patch until I can work out if we can make
proper use of it. AFAICS other mature NSS implementations aren't using it
(OpenLDAP did in the past but have since removed it, will look at how/why).

>> + else
>> + {
>> + SECMOD_DestroyModule(ca_trust);
>> + ca_trust = NULL;
>> + }
>> + }
>
> Might just misunderstand: How can it be ok to destroy ca_trust here? What if
> there's other connections using it? The same thread might be using multiple
> connections, and multiple threads might be using connections. Seems very much
> not thread safe.

Right, that's a leftover from early hacking that I had missed. Fixed.

>> + /* This locking is modelled after fe-secure-openssl.c */
>> + if (ssl_config_mutex == NULL)
>> + {
>> + ...
>
> I'd very much like to avoid duplicating this code. Can we put it somewhere
> combined instead?

I can look at splitting it out to fe-secure-common.c. A first step here to
keep the goalposts from moving in this patch would be to look at combining lock
init in fe-secure-openssl.c:pgtls_init() and fe-connect.c:default_threadlock,
and then just apply the same recipe here once landed. This could be done
independent of this patch.

>> + /*
>> + * The NSPR documentation states that runtime initialization via PR_Init
>> + * is no longer required, as the first caller into NSPR will perform the
>> + * initialization implicitly. See be-secure-nss.c for further discussion
>> + * on PR_Init.
>> + */
>> + PR_Init(PR_USER_THREAD, PR_PRIORITY_NORMAL, 0);
>
> Why does this, and several subsequent bits, have to happen under a lock?

NSS initialization isn't thread-safe, there is more discussion upthread in and
around this email:

https://postgr.es/m/c8d4bc0dfd266799ab4213f1673a813786ac0c70.camel@vmware.com

> Why are some parts returning -1 and some PGRES_POLLING_FAILED? -1 certainly
> isn't a member of PostgresPollingStatusType.

That was a thinko, fixed.

>> + /*
>> + * The error cases for PR_Recv are not documented, but can be
>> + * reverse engineered from _MD_unix_map_default_error() in the
>> + * NSPR code, defined in pr/src/md/unix/unix_errors.c.
>> + */
>
> Can we propose a patch to document them? Don't want to get bitten by this
> suddenly changing...

I can certainly propose something on their mailinglist, but I unfortunately
wouldn't get my hopes up too high as NSS and documentation aren't exactly best
friends (the in-tree docs doesn't cover the API and Mozilla recently removed
most of the online docs in their neverending developer site reorg).

>> From a12769bd793a8e073125c3b3a176b355335646bc Mon Sep 17 00:00:00 2001
>> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
>> Date: Mon, 8 Feb 2021 23:52:45 +0100
>> Subject: [PATCH v52 07/11] nss: Support NSS in pgcrypto
>>
>> This extends pgcrypto to be able to use libnss as a cryptographic
>> backend for pgcrypto much like how OpenSSL is a supported backend.
>> Blowfish is not a supported cipher in NSS, so the implementation
>> falls back on the built-in BF code to be compatible in terms of
>> cipher support.
>
> I wish we didn't have pgcrypto in its current form.

Yes. Very much yes. I don't think doing anything about that in the context of
this patch is wise, but a discussion on where to take pgcrypto in the future
would probably be a good idea.

>> From 5079ce8a677074b93ef1f118d535c6dee4ce64f9 Mon Sep 17 00:00:00 2001
>> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
>> Date: Mon, 8 Feb 2021 23:52:55 +0100
>> Subject: [PATCH v52 10/11] nss: Build infrastructure
>>
>> Finally this adds the infrastructure to build a postgres installation
>> with libnss support.
>
> I would suggest trying to come up with a way to reorder / split the series so
> that smaller pieces are committable. The way you have this right now leaves
> you with applying all of it at once as the only realistic way. And this
> patchset is too large for that.

I completely agree, the hard part is identifying smaller sets which also make
sense and which doesn't leave the tree in a bad state should anyone check out
that specific point in time.

The two commits in the patchset that are "easy" to consider for pushing
independently in this regard are IMO:

* 0002 Test refactoring to support multiple TLS libraries.
* 0004 Check for empty stderr during connect_ok

The refactoring in 0002 is hopefully not too controversial, but it clearly
needs eyes from someone more familiar with modern and idiomatic Perl. 0004
could IMO be pushed regardless of the fate of this patchset (after being
floated in its own thread on -hackers).

In order to find a good split I think we need to figure what to optimize for;
do we optimize for ease of reverting should that be needed, or along
functionality borders, or something else? I don't have good ideas here, but a
single 7596 insertions(+), 421 deletions(-) commit is clearly not a good idea.

Stephen had an idea off-list that we could look at splitting this across the
server/client boundary, which I think is the only idea I've so far which has
legs. (The first to go in would come with the common code of course.)

Do you have any thoughts after reading through the patch?

The attached v53 incorporates the fixes discussed above, and builds green for
both OpenSSL and NSS in Cirrus on my Github repo (thanks again for your work on
those files) so it will be interesting to see the CFBot running them. Next
would be to figure out how to make the MSVC build it, basing an attempt on
Andrew's blogpost.

--
Daniel Gustafsson https://vmware.com/

Attachment Content-Type Size
v53-0011-NSS-experimental-support-for-NSS-in-CI.patch application/octet-stream 3.2 KB
v53-0010-nss-Build-infrastructure.patch application/octet-stream 25.1 KB
v53-0009-nss-Support-NSS-in-cryptohash.patch application/octet-stream 7.2 KB
v53-0008-nss-Support-NSS-in-sslinfo.patch application/octet-stream 9.5 KB
v53-0007-nss-Support-NSS-in-pgcrypto.patch application/octet-stream 78.2 KB
v53-0006-nss-Documentation.patch application/octet-stream 35.6 KB
v53-0005-nss-pg_strong_random-support.patch application/octet-stream 2.0 KB
v53-0004-test-check-for-empty-stderr-during-connect_ok.patch application/octet-stream 3.0 KB
v53-0003-nss-Add-NSS-specific-tests.patch application/octet-stream 64.2 KB
v53-0002-Refactor-SSL-testharness-for-multiple-library.patch application/octet-stream 13.2 KB
v53-0001-nss-Support-libnss-as-TLS-library-in-libpq.patch application/octet-stream 101.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-01-26 20:45:18 Re: refactoring basebackup.c
Previous Message Robert Haas 2022-01-26 20:24:25 Re: Refactoring of compression options in pg_basebackup