Re: Transparent column encryption

From: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Transparent column encryption
Date: 2022-07-05 10:54:13
Message-ID: 79f08a39-a7da-5157-cef4-378fb60c18f8@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Rebased patch, no new functionality.

On 29.06.22 01:29, Peter Eisentraut wrote:
> Here is a new version of this patch.  See also the original description
> quoted below.  I have done a significant amount of work on this over the
> last few months.  Some important news include:
>
> - The cryptography has been improved.  It now uses an AEAD scheme, and
> for deterministic encryption a proper SIV construction.
>
> - The OpenSSL-specific parts have been moved to a separate file in
> libpq.  Non-OpenSSL builds compile and work (without functionality, of
> course).
>
> - libpq handles multiple CEKs and CMKs, including changing keys on the fly.
>
> - libpq supports a mode to force encryption of certain values.
>
> - libpq supports a flexible configuration system for looking up CMKs,
> including support for external key management systems.
>
> - psql has a new \gencr command that allows passing in bind parameters
> for (potential) encryption.
>
> - There is some more pg_dump and psql support.
>
> - The new data types for storing encrypted data have been renamed for
> clarity.
>
> - Various changes to the protocol compared to the previous patch.
>
> - The patch contains full documentation of the protocol changes,
> glossary entries, and more new documentation.
>
> The major pieces that are still missing are:
>
> - DDL support for registering keys
>
> - Protocol versioning or feature flags
>
> Other than that it's pretty complete in my mind.
>
> For interested reviewers, I have organized the patch so that you can
> start reading it top to bottom: The documentation comes first, then the
> tests, then the code changes.  Even some feedback on the first or first
> two aspects would be valuable to me.
>
> Old news follows:
>
> On 03.12.21 22:32, Peter Eisentraut wrote:
>> I want to present my proof-of-concept patch for the transparent column
>> encryption feature.  (Some might also think of it as automatic
>> client-side encryption or similar, but I like my name.)  This feature
>> enables the {automatic,transparent} encryption and decryption of
>> particular columns in the client.  The data for those columns then
>> only ever appears in ciphertext on the server, so it is protected from
>> the "prying eyes" of DBAs, sysadmins, cloud operators, etc.  The
>> canonical use case for this feature is storing credit card numbers
>> encrypted, in accordance with PCI DSS, as well as similar situations
>> involving social security numbers etc.  Of course, you can't do any
>> computations with encrypted values on the server, but for these use
>> cases, that is not necessary.  This feature does support deterministic
>> encryption as an alternative to the default randomized encryption, so
>> in that mode you can do equality lookups, at the cost of some
>> security.
>>
>> This functionality also exists in other SQL database products, so the
>> overall concepts weren't invented by me by any means.
>>
>> Also, this feature has nothing to do with the on-disk encryption
>> feature being contemplated in parallel.  Both can exist independently.
>>
>> The attached patch has all the necessary pieces in place to make this
>> work, so you can have an idea how the overall system works.  It
>> contains some documentation and tests to help illustrate the
>> functionality.  But it's missing the remaining 90% of the work,
>> including additional DDL support, error handling, robust memory
>> management, protocol versioning, forward and backward compatibility,
>> pg_dump support, psql \d support, refinement of the cryptography, and
>> so on.  But I think obvious solutions exist to all of those things, so
>> it isn't that interesting to focus on them for now.
>>
>> ------
>>
>> Now to the explanation of how it works.
>>
>> You declare a column as encrypted in a CREATE TABLE statement.  The
>> column value is encrypted by a symmetric key called the column
>> encryption key (CEK).  The CEK is a catalog object.  The CEK key
>> material is in turn encrypted by an assymmetric key called the column
>> master key (CMK).  The CMK is not stored in the database but somewhere
>> where the client can get to it, for example in a file or in a key
>> management system.  When a server sends rows containing encrypted
>> column values to the client, it first sends the required CMK and CEK
>> information (new protocol messages), which the client needs to record.
>> Then, the client can use this information to automatically decrypt the
>> incoming row data and forward it in plaintext to the application.
>>
>> For the CMKs, the catalog object specifies a "provider" and generic
>> options.  Right now, libpq has a "file" provider hardcoded, and it
>> takes a "filename" option.  Via some mechanism to be determined,
>> additional providers could be loaded and then talk to key management
>> systems via http or whatever.  I have left some comments in the libpq
>> code where the hook points for this could be.
>>
>> The general idea would be for an application to have one CMK per area
>> of secret stuff, for example, for credit card data.  The CMK can be
>> rotated: each CEK can be represented multiple times in the database,
>> encrypted by a different CMK.  (The CEK can't be rotated easily, since
>> that would require reading out all the data from a table/column and
>> reencrypting it.  We could/should add some custom tooling for that,
>> but it wouldn't be a routine operation.)
>>
>> The encryption algorithms are mostly hardcoded right now, but there
>> are facilities for picking algorithms and adding new ones that will be
>> expanded.  The CMK process uses RSA-OAEP.  The CEK process uses
>> AES-128-CBC right now; a more complete solution should probably
>> involve some HMAC thrown in.
>>
>> In the server, the encrypted datums are stored in types called
>> encryptedr and encryptedd (for randomized and deterministic
>> encryption).  These are essentially cousins of bytea.  For the rest of
>> the database system below the protocol handling, there is nothing
>> special about those.  For example, encryptedr has no operators at all,
>> encryptedd has only an equality operator.  pg_attribute has a new
>> column attrealtypid that stores the original type of the data in the
>> column.  This is only used for providing it to clients, so that
>> higher-level clients can convert the decrypted value to their
>> appropriate data types in their environments.
>>
>> Some protocol extensions are required.  These should be guarded by
>> some _pq_... setting, but this is not done in this patch yet.  As
>> mentioned above, extra messages are added for sending the CMKs and
>> CEKs.  In the RowDescription message, I have commandeered the format
>> field to add a bit that indicates that the field is encrypted.  This
>> could be made a separate field, and there should probably be
>> additional fields to indicate the algorithm and CEK name, but this was
>> easiest for now.  The ParameterDescription message is extended to
>> contain format fields for each parameter, for the same purpose.
>> Again, this could be done differently.
>>
>> Speaking of parameter descriptions, the trickiest part of this whole
>> thing appears to be how to get transparently encrypted data into the
>> database (as opposed to reading it out).  It is required to use
>> protocol-level prepared statements (i.e., extended query) for this.
>> The client must first prepare a statement, then describe the statement
>> to get parameter metadata, which indicates which parameters are to be
>> encrypted and how.  So this will require some care by applications
>> that want to do this, but, well, they probably should be careful
>> anyway.  In libpq, the existing APIs make this difficult, because
>> there is no way to pass the result of a describe-statement call back
>> into execute-statement-with-parameters.  I added new functions that do
>> this, so you then essentially do
>>
>>      res0 = PQdescribePrepared(conn, "");
>>      res = PQexecPrepared2(conn, "", 2, values, NULL, NULL, 0, res0);
>>
>> (The name could obviously be improved.)  Other client APIs that have a
>> "statement handle" concept could do this more elegantly and probably
>> without any API changes.
>>
>> Another challenge is that the parse analysis must check which
>> underlying column a parameter corresponds to.  This is similar to
>> resorigtbl and resorigcol in the opposite direction.  The current
>> implementation of this works for the test cases, but I know it has
>> some problems, so I'll continue working in this.  This functionality
>> is in principle available to all prepared-statement variants, not only
>> protocol-level.  So you can see in the tests that I expanded the
>> pg_prepared_statements view to show this information as well, which
>> also provides an easy way to test and debug this functionality
>> independent of column encryption.
>>
>> And also, psql doesn't use prepared statements, so writing into
>> encrypted columns currently doesn't work at all via psql.  (Reading
>> works no problem.)  All the test code currently uses custom libpq C
>> programs.  We should think about a way to enable prepared statements
>> in psql, perhaps something like
>>
>> INSERT INTO t1 VALUES ($1, $2) \gg 'val1' 'val2'
>>
>> (\gexec and \gx are already taken.)
>>
>> ------
>>
>> This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
>> on the direction.  As I mentioned above, a lot of the remaining work
>> is arguably mostly straightforward.  Some closer examination of the
>> issues surrounding the libpq API changes and psql would be useful.
>> Perhaps there are other projects where that kind of functionality
>> would also be useful.

Attachment Content-Type Size
v3-0001-Transparent-column-encryption.patch text/plain 224.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-07-05 11:00:06 Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Previous Message Julien Rouhaud 2022-07-05 10:50:32 Re: Schema variables - new implementation for Postgres 15