Re: Transparent column encryption

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: Jacob Champion <jchampion(at)timescale(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Transparent column encryption
Date: 2022-07-18 10:53:23
Message-ID: c3d59b70-cf22-a4bf-4613-91d040b988bd@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15.07.22 19:47, Jacob Champion wrote:
>> The CEK key
>> material is in turn encrypted by an assymmetric key called the column
>> master key (CMK).
>
> I'm not yet understanding why the CMK is asymmetric.

I'm not totally sure either. I started to build it that way because
other systems were doing it that way, too. But I have been thinking
about adding a symmetric alternative for the CMKs as well (probably AESKW).

I think there are a couple of reasons why asymmetric keys are possibly
useful for CMKs:

Some other products make use of secure enclaves to do computations on
(otherwise) encrypted values on the server. I don't fully know how that
works, but I suspect that asymmetric keys can play a role in that. (I
don't have any immediate plans for that in my patch. It seems to be a
dying technology at the moment.)

Asymmetric keys gives you some more options for how you set up the keys
at the beginning. For example, you create the asymmetric key pair on
the host where your client program that wants access to the encrypted
data will run. You put the private key in an appropriate location for
run time. You send the public key to another host. On that other host,
you create the CEK, encrypt it with the CMK, and then upload it into the
server (CREATE COLUMN ENCRYPTION KEY). Then you can wipe that second
host. That way, you can be even more sure that the unencrypted CEK
isn't left anywhere. I'm not sure whether this method is very useful in
practice, but it's interesting.

In any case, as I mentioned above, this particular aspect is up for
discussion.

Also note that if you use a KMS (cmklookup "run" method), the actual
algorithm doesn't even matter (depending on details of the KMS setup),
since you just tell the KMS "decrypt this", and the KMS knows by itself
what algorithm to use. Maybe there should be a way to specify "unknown"
in the ckdcmkalg field.

>> +#define PG_CEK_AEAD_AES_128_CBC_HMAC_SHA_256 130
>> +#define PG_CEK_AEAD_AES_192_CBC_HMAC_SHA_384 131
>> +#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_384 132
>> +#define PG_CEK_AEAD_AES_256_CBC_HMAC_SHA_512 133
>
> It looks like these ciphersuites were abandoned by the IETF. Are there
> existing implementations of them that have been audited/analyzed? Are
> they safe (and do we know that the claims made in the draft are
> correct)? How do they compare to other constructions like AES-GCM-SIV
> and XChacha20-Poly1305?

The short answer is, these same algorithms are used in equivalent
products (see MS SQL Server, MongoDB). They even reference the same
exact draft document.

Besides that, here is my analysis for why these are good choices: You
can't use any of the counter modes, because since the encryption happens
on the client, there is no way to coordinate to avoid nonce reuse. So
among mainstream modes, you are basically left with AES-CBC with a
random IV. In that case, even if you happen to reuse an IV, the
possible damage is very contained.

And then, if you want to use AEAD, you combine that with some MAC, and
HMAC is just as good as any for that.

The referenced draft document doesn't really contain any additional
cryptographic insights, it's just a guide on a particular way to put
these two together.

So altogether I think this is a pretty solid choice.

>> +-- \gencr
>> +-- (This just tests the parameter passing; there is no encryption here.)
>> +CREATE TABLE test_gencr (a int, b text);
>> +INSERT INTO test_gencr VALUES (1, 'one') \gencr
>> +SELECT * FROM test_gencr WHERE a = 1 \gencr
>> + a | b
>> +---+-----
>> + 1 | one
>> +(1 row)
>> +
>> +INSERT INTO test_gencr VALUES ($1, $2) \gencr 2 'two'
>> +SELECT * FROM test_gencr WHERE a IN ($1, $2) \gencr 2 3
>> + a | b
>> +---+-----
>> + 2 | two
>> +(1 row)
> I'd expect \gencr to error out without sending plaintext. I know that
> under the hood this is just setting up a prepared statement, but if I'm
> using \gencr, presumably I really do want to be encrypting my data.
> Would it be a problem to always set force-column-encryption for the
> parameters we're given here? Any unencrypted columns could be provided
> directly.

Yeah, this needs a bit of refinement. You don't want something named
"encr" but it only encrypts some of the time. We could possibly do what
you suggest and make it set the force-encryption flag, or maybe rename
it or add another command that just uses prepared statements and doesn't
promise anything about encryption from its name.

This also ties in with how pg_dump will eventually work. I think by
default pg_dump will just dump things encrypted and set it up so that
COPY writes it back encrypted. But there should probably be a mode that
dumps out plaintext and then uses one of these commands to load the
plaintext back in. What these psql commands need to do also depends on
what pg_dump needs them to do.

>> + <para>
>> + Null values are not encrypted by transparent column encryption; null values
>> + sent by the client are visible as null values in the database. If the fact
>> + that a value is null needs to be hidden from the server, this information
>> + needs to be encoded into a nonnull value in the client somehow.
>> + </para>
>
> This is a major gap, IMO. Especially with the switch to authenticated
> ciphers, because it means you can't sign your NULL values. And having
> each client or user that's out there solve this with a magic in-band
> value seems like a recipe for pain.
>
> Since we're requiring "canonical" use of text format, and the docs say
> there are no embedded or trailing nulls allowed in text values, could we
> steal the use of a single zero byte to mean NULL? One additional
> complication would be that the client would have to double-check that
> we're not writing a NULL into a NOT NULL column, and complain if it
> reads one during decryption. Another complication would be that the
> client would need to complain if it got a plaintext NULL.

You're already alluding to some of the complications. Also consider
that null values could arise from, say, outer joins. So you could be in
a situation where encrypted and unencrypted null values coexist. And of
course the server doesn't know about the encrypted null values. So how
do you maintain semantics, like for aggregate functions, primary keys,
anything that treats null values specially? How do clients deal with a
mix of encrypted and unencrypted null values, how do they know which one
is real. What if the client needs to send a null value back as a
parameter? All of this would create enormous complications, if they can
be solved at all.

I think a way to look at this is that this column encryption feature
isn't suitable for disguising the existence or absence of data, it can
only disguise the particular data that you know exists.

>> + <para>
>> + The <quote>associated data</quote> in these algorithms consists of 4
>> + bytes: The ASCII letters <literal>P</literal> and <literal>G</literal>
>> + (byte values 80 and 71), followed by the algorithm ID as a 16-bit unsigned
>> + integer in network byte order.
>> + </para>
>
> Is this AD intended as a placeholder for the future, or does it serve a
> particular purpose?

It has been recommended that you include the identity of the encryption
algorithm in the AD. This protects the client from having to decrypt
stuff that wasn't meant to be decrypted (in that way).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-07-18 10:54:12 Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages
Previous Message Alexander Korotkov 2022-07-18 10:53:21 Re: PATCH: Add Table Access Method option to pgbench