Re: Support for NSS as a libpq TLS backend

From: Andres Freund <andres(at)anarazel(dot)de>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Jacob Champion <pchampion(at)vmware(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "andrew(dot)dunstan(at)2ndquadrant(dot)com" <andrew(dot)dunstan(at)2ndquadrant(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "thomas(dot)munro(at)gmail(dot)com" <thomas(dot)munro(at)gmail(dot)com>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>
Subject: Re: Support for NSS as a libpq TLS backend
Date: 2022-01-23 21:20:05
Message-ID: 20220123212005.7ocyxqimbdutlfkc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-01-18 13:42:54 +0100, Daniel Gustafsson wrote:
> Fixed, I had made a mistake in the OpenSSL.pm testcode and failed to catch it
> in testing.

> +task:
> + name: Linux - Debian Bullseye (nss)
> [ copy of a bunch of code ]

I also needed similar-but-not-quite-equivalent tasks for the meson patch as
well. I just moved to having a splitting the tasks into a template and a use
of it. It's probably not quite right as I did there, but it might be worth
looking into:

https://github.com/anarazel/postgres/blob/meson/.cirrus.yml#L181

But maybe this case actually has a better solution, see two paragraphs down:

> + install_script: |
> + DEBIAN_FRONTEND=noninteractive apt-get --yes install libnss3 libnss3-dev libnss3-tools libnspr4 libnspr4-dev

This needs an apt-get update beforehand to succeed. That's what caused the last few runs
to fail, see e.g.
https://cirrus-ci.com/task/6293612580306944

Just duplicating the task doesn't really scale once in tree. What about
reconfiguring (note: add --enable-depend) the linux tasks to build against
nss, and then run the relevant subset of tests with it? Most tests don't use
tcp / SSL anyway, so rerunning a small subset of tests should be feasible?

> From 297ee9ab31aa579e002edc335cce83dae19711b1 Mon Sep 17 00:00:00 2001
> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
> Date: Mon, 8 Feb 2021 23:52:22 +0100
> Subject: [PATCH v52 01/11] nss: Support libnss as TLS library in libpq

> 16 files changed, 3192 insertions(+), 7 deletions(-)

Phew. This is a huge patch.

Damn, I only opened this thread to report the CI failure. But now I ended up
doing a small review...

> +#include "common/nss.h"
> +
> +/*
> + * The nspr/obsolete/protypes.h NSPR header typedefs uint64 and int64 with
> + * colliding definitions from ours, causing a much expected compiler error.
> + * Remove backwards compatibility with ancient NSPR versions to avoid this.
> + */
> +#define NO_NSPR_10_SUPPORT
> +#include <nspr.h>
> +#include <prerror.h>
> +#include <prio.h>
> +#include <prmem.h>
> +#include <prtypes.h>

Duplicated with nss.h. Which brings me to:

> +#include <nss.h>

Is it a great idea to have common/nss.h when there's a library header nss.h?
Perhaps we should have a pg_ssl_{nss,openssl}.h or such?

> +/* ------------------------------------------------------------ */
> +/* Public interface */
> +/* ------------------------------------------------------------ */

Nitpicks:
I don't think we typically do multiple /* */ comments in a row for this type
of thing. I also don't particularly like centering things like this, tends to
get inconsistent across comments.

> +/*
> + * be_tls_open_server
> + *
> + * Since NSPR initialization must happen after forking, most of the actual
> + * setup of NSPR/NSS is done here rather than in be_tls_init.

The "Since ... must happen after forking" sounds like it's referencing a
previously remarked upon fact. But I don't see anything but a copy of this
comment.

Does this make some things notably more expensive? Presumably it does remove a
bunch of COW opportunities, but likely that's not a huge factor compared to
assymetric crypto negotiation...

Maybe soem of this commentary should migrate to the file header or such?

> This introduce
> + * differences with the OpenSSL support where some errors are only reported
> + * at runtime with NSS where they are reported at startup with OpenSSL.

Found this sentence hard to parse somehow.

It seems pretty unfriendly to only have minimal error checking at postmaster
startup time. Seems at least the presence and usability of keys should be done
*also* at that time?

> + /*
> + * If no ciphers are specified, enable them all.
> + */
> + if (!SSLCipherSuites || strlen(SSLCipherSuites) == 0)
> + {
> + status = NSS_SetDomesticPolicy();
> + if (status != SECSuccess)
> + {
> + ereport(COMMERROR,
> + (errmsg("unable to set cipher policy: %s",
> + pg_SSLerrmessage(PR_GetError()))));
> + return -1;
> + }
> + }
> + else
> + {
> + char *ciphers,
> + *c;
> +
> + char *sep = ":;, ";
> + PRUint16 ciphercode;
> + const PRUint16 *nss_ciphers;
> + bool found = false;
> +
> + /*
> + * If the user has specified a set of preferred cipher suites we start
> + * by turning off all the existing suites to avoid the risk of down-
> + * grades to a weaker cipher than expected.
> + */
> + nss_ciphers = SSL_GetImplementedCiphers();
> + for (int i = 0; i < SSL_GetNumImplementedCiphers(); i++)
> + SSL_CipherPrefSet(model, nss_ciphers[i], PR_FALSE);
> +
> + ciphers = pstrdup(SSLCipherSuites);
> +
> + for (c = strtok(ciphers, sep); c; c = strtok(NULL, sep))
> + {
> + if (pg_find_cipher(c, &ciphercode))
> + {
> + status = SSL_CipherPrefSet(model, ciphercode, PR_TRUE);
> + found = true;
> + if (status != SECSuccess)
> + {
> + ereport(COMMERROR,
> + (errmsg("invalid cipher-suite specified: %s", c)));
> + return -1;

It likely doesn't matter much because the backend will exit, but because
COMERROR doesn't throw, it seems like this will leak "ciphers"?

> + }
> + }
> + }
> +
> + pfree(ciphers);
> +
> + if (!found)
> + {
> + ereport(COMMERROR,
> + (errmsg("no cipher-suites found")));
> + return -1;
> + }
> + }

Seems like this could reasonably done in a separate function?

> + server_cert = PK11_FindCertFromNickname(ssl_cert_file, (void *) port);
> + if (!server_cert)
> + {
> + if (dummy_ssl_passwd_cb_called)
> + {
> + ereport(COMMERROR,
> + (errmsg("unable to load certificate for \"%s\": %s",
> + ssl_cert_file, pg_SSLerrmessage(PR_GetError())),
> + errhint("The certificate requires a password.")));
> + return -1;
> + }

I assume PR_GetError() is some thread-local construct, given it's also used in
libpq? Why, oh why, do people copy the abysmal "global errno" approach
everywhere.

> +ssize_t
> +be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
> +{

I'm not a fan of duplicating the symbol names between be-secure-openssl.c and
this. For one it's annoying for source code naviation. It also seems that at
some point we might want to be able to link against both at the same time?
Maybe we should name them unambiguously and then use some indirection in a
header somewhere?

> + ssize_t n_read;
> + PRErrorCode err;
> +
> + n_read = PR_Read(port->pr_fd, ptr, len);
> +
> + if (n_read < 0)
> + {
> + err = PR_GetError();
> +
> + if (err == PR_WOULD_BLOCK_ERROR)
> + {
> + *waitfor = WL_SOCKET_READABLE;
> + errno = EWOULDBLOCK;
> + }
> + else
> + errno = ECONNRESET;
> + }
> +
> + return n_read;
> +}
> +
> +ssize_t
> +be_tls_write(Port *port, void *ptr, size_t len, int *waitfor)
> +{
> + ssize_t n_write;
> + PRErrorCode err;
> + PRIntn flags = 0;
> +
> + /*
> + * The flags parameter to PR_Send is no longer used and is, according to
> + * the documentation, required to be zero.
> + */
> + n_write = PR_Send(port->pr_fd, ptr, len, flags, PR_INTERVAL_NO_WAIT);
> +
> + if (n_write < 0)
> + {
> + err = PR_GetError();
> +
> + if (err == PR_WOULD_BLOCK_ERROR)
> + {
> + *waitfor = WL_SOCKET_WRITEABLE;
> + errno = EWOULDBLOCK;
> + }
> + else
> + errno = ECONNRESET;
> + }
> +
> + return n_write;
> +}
> +
> +/*
> + * be_tls_close
> + *
> + * Callback for closing down the current connection, if any.
> + */
> +void
> +be_tls_close(Port *port)
> +{
> + if (!port)
> + return;
> + /*
> + * Immediately signal to the rest of the backend that this connnection is
> + * no longer to be considered to be using TLS encryption.
> + */
> + port->ssl_in_use = false;
> +
> + if (port->peer_cn)
> + {
> + SSL_InvalidateSession(port->pr_fd);
> + pfree(port->peer_cn);
> + port->peer_cn = NULL;
> + }
> +
> + PR_Close(port->pr_fd);
> + port->pr_fd = NULL;

What if we failed before initializing pr_fd?

> + /*
> + * Since there is no password callback in NSS when the server starts up,
> + * it makes little sense to create an interactive callback. Thus, if this
> + * is a retry attempt then give up immediately.
> + */
> + if (retry)
> + return NULL;

That's really not great. Can't we do something like initialize NSS in
postmaster, load the key into memory, including prompting, and then shut nss
down again?

> +/*
> + * raw_subject_common_name
> + *
> + * Returns the Subject Common Name for the given certificate as a raw char
> + * buffer (that is, without any form of escaping for unprintable characters or
> + * embedded nulls), with the length of the buffer returned in the len param.
> + * The buffer is allocated in the TopMemoryContext and is given a NULL
> + * terminator so that callers are safe to call strlen() on it.
> + *
> + * This is used instead of CERT_GetCommonName(), which always performs quoting
> + * and/or escaping. NSS doesn't appear to give us a way to easily unescape the
> + * result, and we need to store the raw CN into port->peer_cn for compatibility
> + * with the OpenSSL implementation.
> + */

Do we have a testcase for embedded NULLs in common names?

> +static char *
> +raw_subject_common_name(CERTCertificate *cert, unsigned int *len)
> +{
> + CERTName subject = cert->subject;
> + CERTRDN **rdn;
> +
> + for (rdn = subject.rdns; *rdn; rdn++)
> + {
> + CERTAVA **ava;
> +
> + for (ava = (*rdn)->avas; *ava; ava++)
> + {
> + SECItem *buf;
> + char *cn;
> +
> + if (CERT_GetAVATag(*ava) != SEC_OID_AVA_COMMON_NAME)
> + continue;
> +
> + /* Found a CN, decode and copy it into a newly allocated buffer */
> + buf = CERT_DecodeAVAValue(&(*ava)->value);
> + if (!buf)
> + {
> + /*
> + * This failure case is difficult to test. (Since this code
> + * runs after certificate authentication has otherwise
> + * succeeded, you'd need to convince a CA implementation to
> + * sign a corrupted certificate in order to get here.)

Why is that hard with a toy CA locally? Might not be worth the effort, but if
the comment explicitly talks about it being hard...

> + * Follow the behavior of CERT_GetCommonName() in this case and
> + * simply return NULL, as if a Common Name had not been found.
> + */
> + goto fail;
> + }
> +
> + cn = MemoryContextAlloc(TopMemoryContext, buf->len + 1);
> + memcpy(cn, buf->data, buf->len);
> + cn[buf->len] = '\0';
> +
> + *len = buf->len;
> +
> + SECITEM_FreeItem(buf, PR_TRUE);
> + return cn;
> + }
> + }
> +
> +fail:
> + /* Not found */
> + *len = 0;
> + return NULL;
> +}
>

> +/*
> + * pg_SSLShutdownFunc
> + * Callback for NSS shutdown
> + *
> + * If NSS is terminated from the outside when the connection is still in use

What does "NSS is terminated from the outside when the connection" really
mean? Does this mean the client initiating something?

> + * we must treat this as potentially hostile and immediately close to avoid
> + * leaking the connection in any way. Once this is called, NSS will shutdown
> + * regardless so we may as well clean up the best we can. Returning SECFailure
> + * will cause the NSS shutdown to return with an error, but it will shutdown
> + * nevertheless. nss_data is reserved for future use and is always NULL.
> + */
> +static SECStatus
> +pg_SSLShutdownFunc(void *private_data, void *nss_data)
> +{
> + Port *port = (Port *) private_data;
> +
> + if (!port || !port->ssl_in_use)
> + return SECSuccess;

How can that happen?

> + /*
> + * There is a connection still open, close it and signal to whatever that
> + * called the shutdown that it was erroneous.
> + */
> + be_tls_close(port);
> + be_tls_destroy();

And this doesn't have any dangerous around those functions getting called
again later?

> +void
> +pgtls_close(PGconn *conn)
> +{
> + conn->ssl_in_use = false;
> + conn->has_password = false;
> +
> + /*
> + * If the system trust module has been loaded we must try to unload it
> + * before closing the context, since it will otherwise fail reporting a
> + * SEC_ERROR_BUSY error.
> + */
> + if (ca_trust != NULL)
> + {
> + if (SECMOD_UnloadUserModule(ca_trust) != SECSuccess)
> + {
> + pqInternalNotice(&conn->noticeHooks,
> + "unable to unload trust module");
> + }
> + else
> + {
> + SECMOD_DestroyModule(ca_trust);
> + ca_trust = NULL;
> + }
> + }

Might just misunderstand: How can it be ok to destroy ca_trust here? What if
there's other connections using it? The same thread might be using multiple
connections, and multiple threads might be using connections. Seems very much
not thread safe.

> +PostgresPollingStatusType
> +pgtls_open_client(PGconn *conn)
> +{
> + SECStatus status;
> + PRFileDesc *model;
> + NSSInitParameters params;
> + SSLVersionRange desired_range;
> +
> +#ifdef ENABLE_THREAD_SAFETY
> +#ifdef WIN32
> + /* This locking is modelled after fe-secure-openssl.c */
> + if (ssl_config_mutex == NULL)
> + {
> + while (InterlockedExchange(&win32_ssl_create_mutex, 1) == 1)
> + /* loop while another thread owns the lock */ ;
> + if (ssl_config_mutex == NULL)
> + {
> + if (pthread_mutex_init(&ssl_config_mutex, NULL))
> + {
> + printfPQExpBuffer(&conn->errorMessage,
> + libpq_gettext("unable to lock thread"));
> + return PGRES_POLLING_FAILED;
> + }
> + }
> + InterlockedExchange(&win32_ssl_create_mutex, 0);
> + }
> +#endif
> + if (pthread_mutex_lock(&ssl_config_mutex))
> + {
> + printfPQExpBuffer(&conn->errorMessage,
> + libpq_gettext("unable to lock thread"));
> + return PGRES_POLLING_FAILED;
> + }
> +#endif /* ENABLE_THREAD_SAFETY */

I'd very much like to avoid duplicating this code. Can we put it somewhere
combined instead?

> + /*
> + * The NSPR documentation states that runtime initialization via PR_Init
> + * is no longer required, as the first caller into NSPR will perform the
> + * initialization implicitly. See be-secure-nss.c for further discussion
> + * on PR_Init.
> + */
> + PR_Init(PR_USER_THREAD, PR_PRIORITY_NORMAL, 0);

Why does this, and several subsequent bits, have to happen under a lock?

> + if (conn->ssl_max_protocol_version && strlen(conn->ssl_max_protocol_version) > 0)
> + {
> + int ssl_max_ver = ssl_protocol_param_to_nss(conn->ssl_max_protocol_version);
> +
> + if (ssl_max_ver == -1)
> + {
> + printfPQExpBuffer(&conn->errorMessage,
> + libpq_gettext("invalid value \"%s\" for maximum version of SSL protocol\n"),
> + conn->ssl_max_protocol_version);
> + return -1;
> + }
> +
> + desired_range.max = ssl_max_ver;
> + }
> +
> + if (SSL_VersionRangeSet(model, &desired_range) != SECSuccess)
> + {
> + printfPQExpBuffer(&conn->errorMessage,
> + libpq_gettext("unable to set allowed SSL protocol version range: %s"),
> + pg_SSLerrmessage(PR_GetError()));
> + return PGRES_POLLING_FAILED;
> + }

Why are some parts returning -1 and some PGRES_POLLING_FAILED? -1 certainly
isn't a member of PostgresPollingStatusType.

> + /*
> + * The error cases for PR_Recv are not documented, but can be
> + * reverse engineered from _MD_unix_map_default_error() in the
> + * NSPR code, defined in pr/src/md/unix/unix_errors.c.
> + */

Can we propose a patch to document them? Don't want to get bitten by this
suddenly changing...

> From a12769bd793a8e073125c3b3a176b355335646bc Mon Sep 17 00:00:00 2001
> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
> Date: Mon, 8 Feb 2021 23:52:45 +0100
> Subject: [PATCH v52 07/11] nss: Support NSS in pgcrypto
>
> This extends pgcrypto to be able to use libnss as a cryptographic
> backend for pgcrypto much like how OpenSSL is a supported backend.
> Blowfish is not a supported cipher in NSS, so the implementation
> falls back on the built-in BF code to be compatible in terms of
> cipher support.

I wish we didn't have pgcrypto in its current form.

> From 5079ce8a677074b93ef1f118d535c6dee4ce64f9 Mon Sep 17 00:00:00 2001
> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
> Date: Mon, 8 Feb 2021 23:52:55 +0100
> Subject: [PATCH v52 10/11] nss: Build infrastructure
>
> Finally this adds the infrastructure to build a postgres installation
> with libnss support.

I would suggest trying to come up with a way to reorder / split the series so
that smaller pieces are committable. The way you have this right now leaves
you with applying all of it at once as the only realistic way. And this
patchset is too large for that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-01-23 21:31:34 Re: fairywren is generating bogus BASE_BACKUP commands
Previous Message Tom Lane 2022-01-23 21:15:17 Re: fairywren is generating bogus BASE_BACKUP commands