Re: Support for NSS as a libpq TLS backend

From: Andres Freund <andres(at)anarazel(dot)de>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Support for NSS as a libpq TLS backend
Date: 2020-10-20 19:15:29
Message-ID: 20201020191529.5yverw3ybaube3pg@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-10-20 14:24:24 +0200, Daniel Gustafsson wrote:
> From 0cb0e6a0ce9adb18bc9d212bd03e4e09fa452972 Mon Sep 17 00:00:00 2001
> From: Daniel Gustafsson <daniel(at)yesql(dot)se>
> Date: Thu, 8 Oct 2020 18:44:28 +0200
> Subject: [PATCH] Support for NSS as a TLS backend v12
> ---
> configure | 223 +++-
> configure.ac | 39 +-
> contrib/Makefile | 2 +-
> contrib/pgcrypto/Makefile | 5 +
> contrib/pgcrypto/nss.c | 773 +++++++++++
> contrib/pgcrypto/openssl.c | 2 +-
> contrib/pgcrypto/px.c | 1 +
> contrib/pgcrypto/px.h | 1 +

Personally I'd like to see this patch broken up a bit - it's quite
large. Several of the changes could easily be committed separately, no?

> if test "$with_openssl" = yes ; then
> + if test x"$with_nss" = x"yes" ; then
> + AC_MSG_ERROR([multiple SSL backends cannot be enabled simultaneously"])
> + fi

Based on a quick look there's no similar error check for the msvc
build. Should there be?

>
> +if test "$with_nss" = yes ; then
> + if test x"$with_openssl" = x"yes" ; then
> + AC_MSG_ERROR([multiple SSL backends cannot be enabled simultaneously"])
> + fi

Isn't this a repetition of the earlier check?

> + CLEANLDFLAGS="$LDFLAGS"
> + # TODO: document this set of LDFLAGS
> + LDFLAGS="-lssl3 -lsmime3 -lnss3 -lplds4 -lplc4 -lnspr4 $LDFLAGS"

Shouldn't this use nss-config or such?

> +if test "$with_nss" = yes ; then
> + AC_CHECK_HEADER(ssl.h, [], [AC_MSG_ERROR([header file <ssl.h> is required for NSS])])
> + AC_CHECK_HEADER(nss.h, [], [AC_MSG_ERROR([header file <nss.h> is required for NSS])])
> +fi

Hm. For me, on debian, these headers are not directly in the default
include search path, but would be as nss/ssl.h. I don't see you adding
nss/ to CFLAGS anywhere? How does this work currently?

I think it'd also be better if we could include these files as nss/ssl.h
etc - ssl.h is a name way too likely to conflict imo.

> +++ b/src/backend/libpq/be-secure-nss.c
> @@ -0,0 +1,1158 @@
> +/*
> + * BITS_PER_BYTE is also defined in the NSPR header files, so we need to undef
> + * our version to avoid compiler warnings on redefinition.
> + */
> +#define pg_BITS_PER_BYTE BITS_PER_BYTE
> +#undef BITS_PER_BYTE

Most compilers/preprocessors don't warn about redefinitions when they
would result in the same value (IIRC we have some cases of redefinitions
in tree even). Does nspr's differ?

> +/*
> + * The nspr/obsolete/protypes.h NSPR header typedefs uint64 and int64 with
> + * colliding definitions from ours, causing a much expected compiler error.
> + * The definitions are however not actually used in NSPR at all, and are only
> + * intended for what seems to be backwards compatibility for apps written
> + * against old versions of NSPR. The following comment is in the referenced
> + * file, and was added in 1998:
> + *
> + * This section typedefs the old 'native' types to the new PR<type>s.
> + * These definitions are scheduled to be eliminated at the earliest
> + * possible time. The NSPR API is implemented and documented using
> + * the new definitions.
> + *
> + * As there is no opt-out from pulling in these typedefs, we define the guard
> + * for the file to exclude it. This is incredibly ugly, but seems to be about
> + * the only way around it.
> + */
> +#define PROTYPES_H
> +#include <nspr.h>
> +#undef PROTYPES_H

Yuck :(.

> +int
> +be_tls_init(bool isServerStart)
> +{
> + SECStatus status;
> + SSLVersionRange supported_sslver;
> +
> + /*
> + * Set up the connection cache for multi-processing application behavior.

Hm. Do we necessarily want that? Session resumption is not exactly
unproblematic... Or does this do something else?

> + * If we are in ServerStart then we initialize the cache. If the server is
> + * already started, we inherit the cache such that it can be used for
> + * connections. Calling SSL_ConfigMPServerSIDCache sets an environment
> + * variable which contains enough information for the forked child to know
> + * how to access it. Passing NULL to SSL_InheritMPServerSIDCache will
> + * make the forked child look it up by the default name SSL_INHERITANCE,
> + * if env vars aren't inherited then the contents of the variable can be
> + * passed instead.
> + */

Does this stuff work on windows / EXEC_BACKEND?

> + * The below parameters are what the implicit initialization would've done
> + * for us, and should work even for older versions where it might not be
> + * done automatically. The last parameter, maxPTDs, is set to various
> + * values in other codebases, but has been unused since NSPR 2.1 which was
> + * released sometime in 1998.
> + */
> + PR_Init(PR_USER_THREAD, PR_PRIORITY_NORMAL, 0 /* maxPTDs */ );

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference/PR_Init
says that currently all parameters are ignored?

> + /*
> + * Import the already opened socket as we don't want to use NSPR functions
> + * for opening the network socket due to how the PostgreSQL protocol works
> + * with TLS connections. This function is not part of the NSPR public API,
> + * see the comment at the top of the file for the rationale of still using
> + * it.
> + */
> + pr_fd = PR_ImportTCPSocket(port->sock);
> + if (!pr_fd)
> + ereport(ERROR,
> + (errmsg("unable to connect to socket")));

I don't see the comment you're referring to?

> + /*
> + * Most of the documentation available, and implementations of, NSS/NSPR
> + * use the PR_NewTCPSocket() function here, which has the drawback that it
> + * can only create IPv4 sockets. Instead use PR_OpenTCPSocket() which
> + * copes with IPv6 as well.
> + */
> + model = PR_OpenTCPSocket(port->laddr.addr.ss_family);
> + if (!model)
> + ereport(ERROR,
> + (errmsg("unable to open socket")));
> +
> + /*
> + * Convert the NSPR socket to an SSL socket. Ensuring the success of this
> + * operation is critical as NSS SSL_* functions may return SECSuccess on
> + * the socket even though SSL hasn't been enabled, which introduce a risk
> + * of silent downgrades.
> + */
> + model = SSL_ImportFD(NULL, model);
> + if (!model)
> + ereport(ERROR,
> + (errmsg("unable to enable TLS on socket")));

It's confusing that these functions do not actually reference the socket
via some handle :(. What does opening a socket do here?

> + /*
> + * Configure the allowed cipher. If there are no user preferred suites,

*ciphers?

> +
> + port->pr_fd = SSL_ImportFD(model, pr_fd);
> + if (!port->pr_fd)
> + ereport(ERROR,
> + (errmsg("unable to initialize")));
> +
> + PR_Close(model);

A comment explaining why we first import a NULL into the model, and then
release the model, and import the real fd would be good.

> +ssize_t
> +be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
> +{
> + ssize_t n_read;
> + PRErrorCode err;
> +
> + n_read = PR_Read(port->pr_fd, ptr, len);
> +
> + if (n_read < 0)
> + {
> + err = PR_GetError();

Yay, more thread global state :(.

> + /* XXX: This logic seems potentially bogus? */
> + if (err == PR_WOULD_BLOCK_ERROR)
> + *waitfor = WL_SOCKET_READABLE;
> + else
> + *waitfor = WL_SOCKET_WRITEABLE;

Don't we need to handle failed connections somewhere here? secure_read()
won't know about PR_GetError() etc? How would SSL errors be signalled
upwards here?

Also, as you XXX, it's not clear to me that your mapping would always
result in waiting for the right event? A tls write could e.g. very well
require receiving data etc?

> + /*
> + * At least one byte with password content was returned, and NSS requires
> + * that we return it allocated in NSS controlled memory. If we fail to
> + * allocate then abort without passing back NULL and bubble up the error
> + * on the PG side.
> + */
> + password = (char *) PR_Malloc(len + 1);
> + if (!password)
> + ereport(ERROR,
> + (errcode(ERRCODE_OUT_OF_MEMORY),
> + errmsg("out of memory")));
>
> + strlcpy(password, buf, sizeof(password));
> + explicit_bzero(buf, sizeof(buf));
> +

In case of error you're not bzero'ing out the password!

Separately, I wonder if we should introduce a function for throwing OOM
errors - which then e.g. could print the memory context stats in those
places too...

> +static SECStatus
> +pg_cert_auth_handler(void *arg, PRFileDesc * fd, PRBool checksig, PRBool isServer)
> +{
> + SECStatus status;
> + Port *port = (Port *) arg;
> + CERTCertificate *cert;
> + char *peer_cn;
> + int len;
> +
> + status = SSL_AuthCertificate(CERT_GetDefaultCertDB(), port->pr_fd, checksig, PR_TRUE);
> + if (status == SECSuccess)
> + {
> + cert = SSL_PeerCertificate(port->pr_fd);
> + len = strlen(cert->subjectName);
> + peer_cn = MemoryContextAllocZero(TopMemoryContext, len + 1);
> + if (strncmp(cert->subjectName, "CN=", 3) == 0)
> + strlcpy(peer_cn, cert->subjectName + strlen("CN="), len + 1);
> + else
> + strlcpy(peer_cn, cert->subjectName, len + 1);
> + CERT_DestroyCertificate(cert);
> +
> + port->peer_cn = peer_cn;
> + port->peer_cert_valid = true;

Hm. We either should have something similar to

/*
* Reject embedded NULLs in certificate common name to prevent
* attacks like CVE-2009-4034.
*/
if (len != strlen(peer_cn))
{
ereport(COMMERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg("SSL certificate's common name contains embedded null")));
pfree(peer_cn);
return -1;
}
here, or a comment explaining why not.

Also, what's up with the CN= bit? Why is that needed here, but not for
openssl?

> +static PRFileDesc *
> +init_iolayer(Port *port, int loglevel)
> +{
> + const PRIOMethods *default_methods;
> + PRFileDesc *layer;
> +
> + /*
> + * Start by initializing our layer with all the default methods so that we
> + * can selectively override the ones we want while still ensuring that we
> + * have a complete layer specification.
> + */
> + default_methods = PR_GetDefaultIOMethods();
> + memcpy(&pr_iomethods, default_methods, sizeof(PRIOMethods));
> +
> + pr_iomethods.recv = pg_ssl_read;
> + pr_iomethods.send = pg_ssl_write;
> +
> + /*
> + * Each IO layer must be identified by a unique name, where uniqueness is
> + * per connection. Each connection in a postgres cluster can generate the
> + * identity from the same string as they will create their IO layers on
> + * different sockets. Only one layer per socket can have the same name.
> + */
> + pr_id = PR_GetUniqueIdentity("PostgreSQL");

Seems like it might not be a bad idea to append Server or something?

> +
> + /*
> + * Create the actual IO layer as a stub such that it can be pushed onto
> + * the layer stack. The step via a stub is required as we define custom
> + * callbacks.
> + */
> + layer = PR_CreateIOLayerStub(pr_id, &pr_iomethods);
> + if (!layer)
> + {
> + ereport(loglevel,
> + (errmsg("unable to create NSS I/O layer")));
> + return NULL;
> + }

Why is this accepting a variable log level? The only caller passes ERROR?

> +/*
> + * pg_SSLerrmessage
> + * Create and return a human readable error message given
> + * the specified error code
> + *
> + * PR_ErrorToName only converts the enum identifier of the error to string,
> + * but that can be quite useful for debugging (and in case PR_ErrorToString is
> + * unable to render a message then we at least have something).
> + */
> +static char *
> +pg_SSLerrmessage(PRErrorCode errcode)
> +{
> + char error[128];
> + int ret;
> +
> + /* TODO: this should perhaps use a StringInfo instead.. */
> + ret = pg_snprintf(error, sizeof(error), "%s (%s)",
> + PR_ErrorToString(errcode, PR_LANGUAGE_I_DEFAULT),
> + PR_ErrorToName(errcode));
> + if (ret)
> + return pstrdup(error);

> + return pstrdup(_("unknown TLS error"));
> +}

Why not use psrintf() here?

> +++ b/src/include/common/pg_nss.h
> @@ -0,0 +1,141 @@
> +/*-------------------------------------------------------------------------
> + *
> + * pg_nss.h
> + * Support for NSS as a TLS backend
> + *
> + * These definitions are used by both frontend and backend code.
> + *
> + * Copyright (c) 2020, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + * src/include/common/pg_nss.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef PG_NSS_H
> +#define PG_NSS_H
> +
> +#ifdef USE_NSS
> +
> +#include <sslproto.h>
> +
> +PRUint16 pg_find_cipher(char *name);
> +
> +typedef struct
> +{
> + const char *name;
> + PRUint16 number;
> +} NSSCiphers;
> +
> +#define INVALID_CIPHER 0xFFFF
> +
> +/*
> + * This list is a partial copy of the ciphers in NSS files lib/ssl/sslproto.h
> + * in order to provide a human readable version of the ciphers. It would be
> + * nice to not have to have this, but NSS doesn't provide any API addressing
> + * the ciphers by name. TODO: do we want more of the ciphers, or perhaps less?
> + */
> +static const NSSCiphers NSS_CipherList[] = {
> +
> + {"TLS_NULL_WITH_NULL_NULL", TLS_NULL_WITH_NULL_NULL},

Hm. Is this whole business of defining array constants in a header just
done to avoid having a .c file that needs to be compiled both in
frontend and backend code?

> +/*
> + * The nspr/obsolete/protypes.h NSPR header typedefs uint64 and int64 with
> + * colliding definitions from ours, causing a much expected compiler error.
> + * The definitions are however not actually used in NSPR at all, and are only
> + * intended for what seems to be backwards compatibility for apps written
> + * against old versions of NSPR. The following comment is in the referenced
> + * file, and was added in 1998:
> + *
> + * This section typedefs the old 'native' types to the new PR<type>s.
> + * These definitions are scheduled to be eliminated at the earliest
> + * possible time. The NSPR API is implemented and documented using
> + * the new definitions.
> + *
> + * As there is no opt-out from pulling in these typedefs, we define the guard
> + * for the file to exclude it. This is incredibly ugly, but seems to be about
> + * the only way around it.
> + */

There's a lot of duplicated comments here. Could we move either of the
files to reference the other for longer ones?

> +/*
> + * PR_ImportTCPSocket() is a private API, but very widely used, as it's the
> + * only way to make NSS use an already set up POSIX file descriptor rather
> + * than opening one itself. To quote the NSS documentation:
> + *
> + * "In theory, code that uses PR_ImportTCPSocket may break when NSPR's
> + * implementation changes. In practice, this is unlikely to happen because
> + * NSPR's implementation has been stable for years and because of NSPR's
> + * strong commitment to backward compatibility."
> + *
> + * https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference/PR_ImportTCPSocket
> + *
> + * The function is declared in <private/pprio.h>, but as it is a header marked
> + * private we declare it here rather than including it.
> + */
> +NSPR_API(PRFileDesc *) PR_ImportTCPSocket(int);

Ugh. This is really the way to do this? How do other applications deal
with this problem?

> +#if defined(WIN32)
> +static const char *ca_trust_name = "nssckbi.dll";
> +#elif defined(__darwin__)
> +static const char *ca_trust_name = "libnssckbi.dylib";
> +#else
> +static const char *ca_trust_name = "libnssckbi.so";
> +#endif

There's really no pre-existing handling for this in nss???

> + /*
> + * The original design of NSS was for a single application to use a single
> + * copy of it, initialized with NSS_Initialize() which isn't returning any
> + * handle with which to refer to NSS. NSS initialization and shutdown are
> + * global for the application, so a shutdown in another NSS enabled
> + * library would cause NSS to be stopped for libpq as well. The fix has
> + * been to introduce NSS_InitContext which returns a context handle to
> + * pass to NSS_ShutdownContext. NSS_InitContext was introduced in NSS
> + * 3.12, but the use of it is not very well documented.
> + * https://bugzilla.redhat.com/show_bug.cgi?id=738456
> + *
> + * The InitParameters struct passed can be used to override internal
> + * values in NSS, but the usage is not documented at all. When using
> + * NSS_Init initializations, the values are instead set via PK11_Configure
> + * calls so the PK11_Configure documentation can be used to glean some
> + * details on these.
> + *
> + * https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/PKCS11/Module_Specs

> +
> + if (!nss_context)
> + {
> + char *err = pg_SSLerrmessage(PR_GetError());
> +
> + printfPQExpBuffer(&conn->errorMessage,
> + libpq_gettext("unable to %s certificate database: %s"),
> + conn->cert_database ? "open" : "create",
> + err);
> + free(err);
> + return PGRES_POLLING_FAILED;
> + }
> +
> + /*
> + * Configure cipher policy.
> + */
> + status = NSS_SetDomesticPolicy();

Why is "domestic" the right thing here?

> +
> + PK11_SetPasswordFunc(PQssl_passwd_cb);

Is it actually OK to do stuff like this when other users of NSS might be
present? That's obviously more likely in the libpq case, compared to the
backend case (where it's also possible, of course). What prevents us
from overriding another user's callback?

> +ssize_t
> +pgtls_read(PGconn *conn, void *ptr, size_t len)
> +{
> + PRInt32 nread;
> + PRErrorCode status;
> + int read_errno = 0;
> +
> + nread = PR_Recv(conn->pr_fd, ptr, len, 0, PR_INTERVAL_NO_WAIT);
> +
> + /*
> + * PR_Recv blocks until there is data to read or the timeout expires. Zero
> + * is returned for closed connections, while -1 indicates an error within
> + * the ongoing connection.
> + */
> + if (nread == 0)
> + {
> + read_errno = ECONNRESET;
> + return -1;
> + }

It's a bit confusing to talk about blocking when the socket presumably
is in non-blocking mode, and you're also asking to never wait?

> + if (nread == -1)
> + {
> + status = PR_GetError();
> +
> + switch (status)
> + {
> + case PR_WOULD_BLOCK_ERROR:
> + read_errno = EINTR;
> + break;

Uh, isn't this going to cause a busy-loop by the caller? EINTR isn't the
same as EAGAIN/EWOULDBLOCK?

> + case PR_IO_TIMEOUT_ERROR:
> + break;

What does this mean? We'll return with a 0 errno here, right? When is
this case reachable?

E.g. the comment in fe-misc.c:
/* pqsecure_read set the error message for us */
for this case doesn't seem to be fulfilled by this.

> +/*
> + * Verify that the server certificate matches the hostname we connected to.
> + *
> + * The certificate's Common Name and Subject Alternative Names are considered.
> + */
> +int
> +pgtls_verify_peer_name_matches_certificate_guts(PGconn *conn,
> + int *names_examined,
> + char **first_name)
> +{
> + return 1;
> +}

Uh, huh? Certainly doesn't verify anything...

> +/* ------------------------------------------------------------ */
> +/* PostgreSQL specific TLS support functions */
> +/* ------------------------------------------------------------ */
> +
> +/*
> + * TODO: this a 99% copy of the same function in the backend, make these share
> + * a single implementation instead.
> + */
> +static char *
> +pg_SSLerrmessage(PRErrorCode errcode)
> +{
> + const char *error;
> +
> + error = PR_ErrorToName(errcode);
> + if (error)
> + return strdup(error);
> +
> + return strdup("unknown TLS error");
> +}

Btw, why does this need to duplicate strings, instead of returning a
const char*?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-10-20 19:24:35 Re: [PATCH] SET search_path += octopus
Previous Message Alvaro Herrera 2020-10-20 19:04:20 Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers