Re: Moving forward with TDE [PATCH v3]

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Moving forward with TDE [PATCH v3]
Date: 2023-11-03 02:32:28
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2023-10-31 16:23:17 -0500, David Christensen wrote:
> The patches are as follows:
> 0001 - doc updates
> 0002 - Basic key management and cipher support
> 0003 - Backend-related changes to support heap encryption
> 0004 - modifications to bin tools and programs to manage key rotation and
> add other knowledge
> 0005 - Encrypted/authenticated WAL
> These are very broad strokes at this point and should be split up a bit
> more to make things more granular and easier to review, but I wanted to get
> this update out.

Yes, particularly 0003 really needs to be split - as is it's not easily

> From 327e86d52be1df8de9c3a324cb06b85ba5db9604 Mon Sep 17 00:00:00 2001
> From: David Christensen <david(at)pgguru(dot)net>
> Date: Fri, 29 Sep 2023 15:16:00 -0400
> Subject: [PATCH v3 5/5] Add encrypted/authenticated WAL
> When using an encrypted cluster, we need to ensure that the WAL is also
> encrypted. While we could go with an page-based approach, we use instead a
> per-record approach, using GCM for the encryption method and storing the AuthTag
> in the xl_crc field.
> We change the xl_crc field to instead be a union struct, with a compile-time
> adjustable size to allow us to customize the number of bytes allocated to the
> GCM authtag. This allows us to easily adjust the size of bytes needed to
> support our authentication. (Testing has included up to 12, but leaving at this
> point to 4 due to keeping the size of the WAL records the same.)

Ugh, that'll be quite a bit of overhead in some workloads... You can't really
use such a build for non-encrypted workloads, making this a not very
deployable path...

> @@ -905,20 +905,28 @@ XLogInsertRecord(XLogRecData *rdata,
> {
> /*
> * Now that xl_prev has been filled in, calculate CRC of the record
> - * header.
> + * header. If we are using encrypted WAL, this CRC is overwritten by
> + * the authentication tag, so just zero
> */
> - rdata_crc = rechdr->xl_crc;
> - COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
> - FIN_CRC32C(rdata_crc);
> - rechdr->xl_crc = rdata_crc;
> + if (!encrypt_wal)
> + {
> + rdata_crc = rechdr->xl_integrity.crc;
> + COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_integrity.crc));
> + FIN_CRC32C(rdata_crc);
> + rechdr->xl_integrity.crc = rdata_crc;
> + }
> + else
> + memset(&rechdr->xl_integrity, 0, sizeof(rechdr->xl_integrity));

Why aren't you encrypting most of the data here? Just as for CRC computation,
encrypting a large record in XLOG_BLCKSZ

> * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
> * used to distinguish between block references, and the main data structs.
> */
> +
> +#define XL_AUTHTAG_SIZE 4
> +#define XL_HEADER_PAD 2
> +
> typedef struct XLogRecord
> {
> uint32 xl_tot_len; /* total len of entire record */
> @@ -45,14 +49,16 @@ typedef struct XLogRecord
> XLogRecPtr xl_prev; /* ptr to previous record in log */
> uint8 xl_info; /* flag bits, see below */
> RmgrId xl_rmid; /* resource manager for this record */
> - /* 2 bytes of padding here, initialize to zero */
> - pg_crc32c xl_crc; /* CRC for this record */
> -
> + uint8 xl_pad[XL_HEADER_PAD]; /* required alignment padding */

What does "required" mean here? And why is this defined in a separate define?
And why do we need the explicit field here at all? The union will still
provide sufficient alignment for a uint32.

It also doesn't seem right to remove the comment about needing to zero out the

> From 7557acf60f52da4a86fd9f902bab4804c028dd4b Mon Sep 17 00:00:00 2001
> From: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
> Date: Tue, 31 Oct 2023 15:24:02 -0400
> Subject: [PATCH v3 3/5] Backend-related changes

> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -323,6 +323,11 @@ end_heap_rewrite(RewriteState state)
> state->rs_buffer,
> true);
> + PageEncryptInplace(state->rs_buffer, MAIN_FORKNUM,
> + RelationIsPermanent(state->rs_new_rel),
> + state->rs_blockno,
> + RelationGetSmgr(state->rs_new_rel)->smgr_rlocator.locator.relNumber
> + );
> PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
> smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM,

I don't think it's ok to have to make such changes in a bunch of places. I
think we need to add an abstraction layer between smgr and code like this
first. Luckily I think Heikki was working abstracting away some of these
direct smgr* uses...

> +Architecture
> +------------
> +
> +Fundamentally, cluster file encryption must store data in a file system
> +in such a way that the keys required to decrypt the file system data can
> +only be accessed using somewhere outside of the file system itself. The
> +external requirement can be someone typing in a passphrase, getting a
> +key from a key management server (KMS), or decrypting a key stored in
> +the file system using a hardware security module (HSM). The current
> +architecture supports all of these methods, and includes sample scripts
> +for them.
> +
> +The simplest method for accessing data keys using some external
> +requirement would be to retrieve all data encryption keys from a KMS.
> +However, retrieved keys would still need to be verified as valid. This
> +method also introduces unacceptable complexity for simpler use-cases,
> +like user-supplied passphrases or HSM usage. External key rotation
> +would also be very hard since it would require re-encrypting all the
> +file system data with the new externally-stored keys.
> +
> +For these reason, a two-tiered architecture is used, which uses two
> +types of encryption keys: a key encryption key (KEK) and data encryption
> +keys (DEK). The KEK should not be present unencrypted in the file system
> +--- it should be supplied the user, stored externally (e.g., in a KMS)

*by* the user?

> +or stored in the file system encrypted with a HSM (e.g., PIV device).
> +The DEK is used to encrypt database files and is stored in the same file
> +system as the database but is encrypted using the KEK. Because the DEK
> +is encrypted, its storage in the file system is no more of a security
> +weakness and the storage of the encrypted database files in the same
> +file system.

As is this paragraph doesn't really follow from the prior paragraph for
me. That a KMS would be hard to use isn't obviously related to splitting the
KEK and the DEK.

> +Implementation
> +--------------
> +
> +To enable cluster file encryption, the initdb option
> +--cluster-key-command must be used, which specifies a command to
> +retrieve the KEK.

FWIW, I think "cluster file encryption" is somewhat ambiguous. That could also
mean encrypting on the file system level or such.

> initdb records the cluster_key_command in
> +postgresql.conf. Every time the KEK is needed, the command is run and
> +must return 64 hex characters which are decoded into the KEK. The
> +command is called twice during initdb, and every time the server starts.
> +initdb also sets the encryption method in controldata during server
> +bootstrap.
> +
> +initdb runs "postgres --boot", which calls function
> +kmgr.c::BootStrapKmgr(), which calls the cluster key command. The
> +cluster key command returns a KEK which is used to encrypt random bytes
> +for each DEK and writes them to the file system by
> +kmgr.c::KmgrWriteCryptoKeys() (unless --copy-encryption-keys is used).
> +Currently the DEK files are 0 and 1 and are stored in
> +$PGDATA/pg_cryptokeys/live. The wrapped DEK files use Key Wrapping with
> +Padding which verifies the validity of the KEK.
> +
> +initdb also does a non-boot backend start which calls
> +kmgr.c::InitializeKmgr(), which calls the cluster key command a second
> +time. This decrypts/unwraps the DEK keys and stores them in the shared
> +memory structure KmgrShmem. This step also happens every time the server
> +starts. Later patches will use the keys stored in KmgrShmem to
> +encrypt/decrypt database files. KmgrShmem is erased via
> +explicit_bzero() on server shutdown.

I think this encodes too many details of how initdb works today. It seems
likely that nobody adding or removing a restart will think of updating this
file - nor should they have to. I'd just say that initdb starts/stops
postgres multiple times.

> +Initialization Vector
> +---------------------
> +
> +Nonce means "number used once". An Initialization Vector (IV) is a
> +specific type of nonce. That is, unique but not necessarily random or
> +secret, as specified by the NIST
> +(
> +To generate unique IVs, the NIST recommends two methods:
> +
> + The first method is to apply the forward cipher function, under
> + the same key that is used for the encryption of the plaintext,
> + to a nonce. The nonce must be a data block that is unique to
> + each execution of the encryption operation. For example, the
> + nonce may be a counter, as described in Appendix B, or a message
> + number. The second method is to generate a random data block
> + using a FIPS-approved random number generator.
> +
> +We will use the first method to generate IVs. That is, select nonce
> +carefully and use a cipher with the key to make it unique enough to use
> +as an IV. The nonce selection for buffer encryption and WAL encryption
> +are described below.
> +
> +If the IV was used more than once with the same key (and we only use one
> +data encryption key), changes in the unencrypted data would be visible
> +in the encrypted data.
> +
> +IV for Heap/Index Encryption
> +- - - - - - - - - - - - - -
> +
> +To create the 16-byte IV needed by AES for each page version, we will
> +use the page LSN (8 bytes) and page number (4 bytes).

I still am quite quite unconvinced that using the LSN as a nonce is a good
design decision.

- LSNs can end up being reused after crash restarts
- the LSN does not contain the timeline ID, if a standby is promoted, two
systems can be using the same LSNs
- The LSN does *NOT* actually change every time a page is modified. Even with
wal_log_hint_bits, only the first hint bit modification to a page in a
checkpoint cycles will cause WAL writes - and changing that would have
a quite substantial overhead.

> The LSN is ideal for use in the IV because it is +always increasing, and is
> changed every time a page is updated. The +same LSN is never used for two
> relations with different page contents.

As mentioned above, this is not true - the LSN does *NOT* change every time
the page is updated.

> +However, the same LSN can be used in multiple pages in the same relation
> +--- this can happen when a heap update expires an old tuple and adds a
> +new tuple to another page. By adding the page number to the IV, we keep
> +the IV unique.

There's many other ways that can happen.

> +CREATE DATABASE can be run with two different strategies: FILE_COPY or
> +WAL_LOG. If using WAL_LOG, the heap/index files are automatically
> +rewritten with new LSNs as part of the copy operation and will get new
> +IVs automatically.
> +
> +This approach still works with the older FILE_COPY stragegy; by not
> +using the database id in the IV, CREATE DATABASE can copy the heap/index
> +files from the old database to a new one without decryption/encryption.
> +Both page copies are valid. Once a database changes its pages, it gets
> +new LSNs, and hence new IV.
> +
> +As part of WAL logging, every change of a WAL-logged page gets a new
> +LSN, and therefore a new IV automatically.
> +
> +However, the LSN must then be visible on encrypted pages, so we will not
> +encrypt the LSN on the page. We will also not encrypt the CRC so
> +pg_checksums can still check pages offline without access to the keys.

s/crc/checksum/? The data-page-level checksum isn't a CRC.

> +Non-Permanent Relations
> +- - - - - - - - - - - -
> +
> +To avoid the overhead of generating WAL for non-permanent (unlogged and
> +temporary) relations, we assign fake LSNs that are derived from a
> +counter via xlog.c::GetFakeLSNForUnloggedRel(). (GiST also uses this
> +counter for LSNs.) We also set a bit in the IV so the use of the same
> +value for WAL (real) and fake LSNs will still generate unique IVs. Only
> +main forks are encrypted, not init, vm, or fsm files.

Why aren't other forks encrypted? This seems like a very odd design to me.

> +Hint Bits
> +- - - - -
> +
> +For hint bit changes, the LSN normally doesn't change, which is a
> +problem. By enabling wal_log_hints, you get full page writes to the WAL
> +after the first hint bit change of the checkpoint. This is useful for
> +two reasons. First, it generates a new LSN, which is needed for the IV
> +to be secure. Second, full page images protect against torn pages,
> +which is an even bigger requirement for encryption because the new LSN
> +is re-encrypting the entire page, not just the hint bit changes. You
> +can safely lose the hint bit changes, but you need to use the same LSN
> +to decrypt the entire page, so a torn page with an LSN change cannot be
> +decrypted. To prevent this, wal_log_hints guarantees that the
> +pre-hint-bit version (and previous LSN version) of the page is restored.
> +
> +However, if a hint-bit-modified page is written to the file system
> +during a checkpoint, and there is a later hint bit change switching the
> +same page from clean to dirty during the same checkpoint, we need a new
> +LSN, and wal_log_hints doesn't give us a new LSN here. The fix for this
> +is to update the page LSN by writing a dummy WAL record via
> +xloginsert.c::LSNForEncryption() in such cases.

Ugh, so that's really the plan. That's a substantial overhead in some common



Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-11-03 02:33:26 Re: A recent message added to pg_upgade
Previous Message Japin Li 2023-11-03 02:14:35 Re: Tab completion regression test failed on illumos