Re: Internal key management system

From: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Cary Huang <cary(dot)huang(at)highgo(dot)ca>, Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Moon, Insung" <tsukiwamoon(dot)pgsql(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Sehrope Sarkuni <sehrope(at)jackdb(dot)com>, cary huang <hcary328(at)gmail(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Joe Conway <mail(at)joeconway(dot)com>
Subject: Re: Internal key management system
Date: 2020-10-27 07:07:22
Message-ID: CAGRY4nxCtetFG6wYThVGZXHNhhDDyTqmhDF6oos_-myT3txRGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 26, 2020 at 11:02 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:

TL;DR:

* Important to check that key rotation is possible on a replica, i.e.
primary and standby can have different cluster passphrase and KEK
encrypting the same WAL and heap keys;
* with a HSM we can't read the key out, so a pluggable KEK operations
context or a configurable URI for the KEK is necessary
* I want the SQL key and SQL wrap/unwrap part in a separate patch, I
don't think it's fully baked and oppose its inclusion in its current
form
* Otherwise this looks good so far

Explanation and argument for why below.

> I've not been following very closely, but I definitely agree with the
> general feedback here (more on that below), but to this point- I do
> believe that was the intent, or at least I sure hope that it was. Being
> able to have user/role keys would certainly be good. Having a way for a
> user to log in and unlock their key would also be really nice.

Right. AFAICS this is supposed to provide the foundation layer for
whole-cluster encryption, and it looks ok for that, caveat about HSMs
aside. I see nothing wrong with using a single key for heap (and one
for WAL, or even the same key). Finer grained and autovacuum etc
becomes seriously painful.

I want to take a closer look at how the current implementation will
play with physical replication. I assume the WAL and heap keys have to
be constant for the full cluster lifetime, short of a dump and reload.
But I want to make sure that the KEK+HMAC can differ from one node to
another, i.e. that we can perform KEK rotation on a replica to
re-encrypt the WAL and heap keys against a new KEK. This is important
for backups, and also for effective use of a HSM where we may not want
or be able to have the same key on a primary and its replicas. If it
isn't already supported it looks like it should be simple, but it's
IMO important not to miss.

The main issue I have so far is that I don't think the SQL key
actually fits well with the current proposal. Its proposed interface
and use cases are incomplete, it doesn't fully address key leak risks,
there's no user access control, etc. Also the SQL key part could be
implemented on top of the base cluster encryption part, I don't see
why it actually has to integrate with the whole-cluster key management
directly.

SQL KEY
----

I'm not against the SQL key and wrap/unwrap functionality - quite the
contrary, I think it's really important to have something like it. But
is it appropriate to have a single, fixed-for-cluster-lifetime key for
this, one with no SQL-level access control over who can or cannot use
it, etc? The material encrypted with the key is user-exposed so key
rotation is an issue, but is not addressed here. And the interface
doesn't really solve the numerous potential problems with key material
leaks through logs and error messages.

I just think that if we bake in the proposed user visible wrap/unwrap
interface now we're going to regret it later. How will it work when we
want to add user- or role- level access control for database-stored
keys? When we want to introduce a C-level API for extensions to work
directly with encrypted data like they can work currently with TOASTed
data, to prevent decrypted data from ever becoming SQL function
arguments subject to possible leakage and to allow implementation of
always-encrypted data types, etc?

Most importantly - I don't think the SQL key adds anything really
crucial that we cannot do at the SQL level with an extension. An
extension "pg_wrap" could provide pg_wrap() and pg_unwrap() already,
using a single master key much like the SQL key proposed in this
patch. To store the master key it could:

1. Derive the key at shared_preload_libraries time in the same way
this key management system proposes to do so, using a
pg_wrap.pg_wrap_passphrase_command ; or
2. Read the key from a PEM file on disk, accepting a passphrase to
decrypt it via a password command GUC or an SQL-level function call;
or
3. Read the key from a pg_wrap.pg_wrap_keys extension catalog, which
is superuser-only and which is protected by whole-db encryption if in
use;
4. Like (3) but use generic WAL and a custom relation to make the
in-db key store opaque without use of extensions like pageinspect.

That way we haven't baked some sort of limited wrap/unwrap into Pg's
long term user visible API. I'd be totally happy for such a SQL key
wrap/unwrap to become part of pgcrypto, or a separate extension that
uses pgcrypto, if you're worried about having it available to users. I
just don't really want it in src/backend in its current form.

To simplify (1) we could make the implementation of the KEK/HMAC
derivation accessible from extensions and allow them to provide their
own password callback, though that might make life harder if we want
to change things later, and it'd mean that you couldn't use the
extension on a db that was not configured for whole db encryption. So
I'd actually rather make key derivation and storage the extension's
problem for now.

OTHER TRANSPARENT ENCRYPTION USE CASES
----

Does this patch get in the way of supporting other kinds of
transparent encryption that are frequently requested and are in use on
other systems already?

I don't think so. Whole-cluster encryption is quite separate and the
proposed patch doesn't seem to do anything that'd make table-, row- or
column-level encryption, per-user key management, etc any harder.

Specific use cases I looked at:

* Finer grained keying than whole-cluster for transparent
encryption-at-rest. As soon as we have relations that require user
session supplied information to allow the backend to read the relation
we get into a real mess with autovacuum, logical decoding, etc. So if
anyone wants to implement that sorts of thing they're probably going
to want to do so separately to block-level whole-cluster encryption,
in a way that preserves the normal page and page item structure and
encrypts the row data only.

* Client-driver-assisted transparently encrypted
at-rest-and-in-transit data, where the database engine doesn't have
the encrypt/decrypt keys at all. Again in this case they're going to
have to do that at the row level or column level, not the block
(relfilenode extents and WAL) level, otherwise we can't provide
autovacuum etc.

> > If so, what if I have a HSM [...]
>
> [...] I agree that it'd
> be better to offer an option that can go to the HSM directly.

Right.

> That
> said- I don't think we necessarily want to throw out tho command-based
> option, as users may wish to use a vaulting solution or similar instead
> of an HSM.

I agree. I wasn't proposing to throw out the command based approach,
just provide a way to inform postgres that it should do operations
with the KEK using an external engine instead of deriving its own KEK
from a passphrase and other inputs.

> What I am curious about though- what are the thoughts around
> using a vaulting solution's command-line tool vs. writing code to work
> with an API?

I think the code that fetches the cluster passphrase from a command
should be interceptable by a hook, so organisations with Wacky
Security Policies Written By People Who Have Heard About Computers But
Never Used One can jump through the necessary hoops. I am of course
absolutely not speaking from experience here, no, not at all... see
ssl_passphrase_function in src/backend/libpq/be-secure-openssl.c, and
see src/test/modules/ssl_passphrase_callback/ssl_passphrase_func.c .

So I suggest something like that - a hook that by default calls an
external command but can by overridden by an extension. It wouldn't be
openssl specific like the server key passphrase example though. That
was done with an openssl specific hook because we don't know if we're
going to need a passphrase at all until openssl has opened the key. In
the cluster encryption case we'll know if we're doing our own KEK+HMAC
generation or not without having to ask the SSL library.

> Between these various options, what are the risks of
> having a script vs. using an API and would one or the other weaken the
> overall solution? Or is what's really needed here is a way to tell us
> if it's a passphrase we're getting or a proper key, regardless of the
> method being used to fetch it?

For various vault systems I don't think it matters at all whether the
secret they manage is the key, input used to generate the key, or
input used to decrypt a key stored elsewhere. Either way they have the
pivotal secret. So I don't see much point allowing the command to
return a fully formed key.

The point of a HSM that you don't get to read the key. Pg can never
read the key, it can only perform encrypt and decrypt operations on
the key using the HSM via the SSL library:

Pg -> openssl:
"this is the ciphertext of the wal_key. Please decrypt it for me."
openssl -> engine layer
"engine, please decrypt this"
pkcs#11 engine-> pkcs#11 provider:
"please decrypt this"
pkcs#11 provider -> HSM-specific libraries, network proxies, whatever:
"please decrypt this"
"... here's the plaintext"
<- flows back up

So the KEK used to encrypt the main cluster keys for heap and wal
encryption is never readable by Pg. It usually never enters host
memory - in the case of a HSM, the ciphertext is sent over USB or PCIe
to the HSM and the cleartext comes back.

In openssl, the default engine is file-based with host software crypto
implementations. You can specify alternate engines using various
OpenSSL APIs, or you can specify them by supplying a URI where you'd
usually supply a file path to a key.

I'm proposing we make it easy to supply a key URI and let openssl
handle the engine etc. It's far from perfect, and it's really meant as
a fallback to allow apps that don't natively understand SSL engines
etc to still use them in a limited capacity.

What I'd *prefer* to do is make the function that sets up the KEK
hookable. So by default we'd call a function that'd read the external
passphrase from a command use that to generate KEK+HMAC. But an
extension hook installed at shared_preload_libraries time could
override the behaviour completely and return its own implementation.

In the phrasing of the patch as written that'd probably be a hook that
returns its own pg_cipher_ctx* , overriding the default
implementation of pg_cipher_ctx * pg_cipher_ctx_create(void); . For a
HSM the returned context would delegate operations to the HSM.

> This really locks us into OpenSSL for this, which I don't particularly
> like.

We're pretty locked into openssl already. I don't like it either, it
was just the option that has the least impact/delay on the main work
on this patch.

I'd rather abstract KEK operations behind a context object-like struct
with function pointer members, like we do in many other places in Pg.
Make the default one do the dance of reading the external passphrase
and generating the KEK on the fly. Allow plugins to override it with
their own, and let them set it up to delegate to a HSM or whatever
else they want.

Then ship a simple openssl based default implementation of HSM support
that can be shoved in shared_preload_libraries. Or if we don't like
using s_p_l, add a separate GUC for cluster_encryption_key_manager or
whatever, and a different entrypoint, instead of having s_p_l call
_PG_init() to register a hook.

> > For example if I want to lock my database with a YubiHSM I would configure
> > something like:
> >
> > cluster_encryption_key = 'pkcs11:token=YubiHSM;id=0:0001;type=private'
> >
> > The DB would be encrypted and decrypted using application keys unlocked by
> > the HSM. Backups of the database, stolen disk images, etc, would be
> > unreadable unless you have access to another HSM with the same key loaded.
>
> Well, you would surely just need the key, since you could change the PG
> config to fetch the key from wherever you have it, you wouldn't need an
> actual HSM.

Right - if your HSM was programmed by generating a key and storing
that into the HSM and you have that key backed up in file form
somewhere, you could likely put it in a pem file and use that directly
by pointing Pg at the file instead of an engine URI.

But you might not even have the key. In some HSM implementations the
key is completely sealed - you can program new HSMs to have the same
key by using the same configuration, but you cannot actually obtain
the key short of attacks on the HSM hardware itself. That's very much
by design - the HSM configuration is usually on an air-gapped system,
and it isn't sufficient to decrypt anything unless you also have
access to a copy of the HSM hardware itself. Obviously you accept the
risks if you take that approach, and you must have an escape route
where you can re-encrypt the material protected by the HSM against
some other key. But it's not at all uncommon.

Key rotation is obviously vital to make this vaguely sane. In Pg's
case you'd to change the key configuration, then trigger a key
rotation step, which would decrypt with a context obtained from the
old config then encrypt with a context obtained from the new config.

> > If cluster_encryption_key is unset, Pg would perform its own KEK derivation
> > based on cluster_passphrase_command as currently implemented.
>
> To what I was suggesting above- what if we just had a GUC that's
> "kek_method" with options 'passphrase' and 'direct', where passphrase
> goes through KEK and 'direct' doesn't, which just changes how we treat
> the results of called cluster_passphrase_command?

That won't work for a HSM. It is not possible to extract the key.
"direct" cannot be implemented.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-10-27 07:07:46 Re: Online checksums verification in the backend
Previous Message Amit Langote 2020-10-27 06:39:53 Re: Parallel Append can break run-time partition pruning