Re: [PATCHES] Post-special page storage TDE support

From: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Post-special page storage TDE support
Date: 2023-05-30 17:35:37
Message-ID: CAOxo6XKKPJkA2=xMxFuqgdhWjcBvHNXVnpNELr008-VPhKqmSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 12, 2023 at 7:48 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
> Greetings,
>
> * David Christensen (david(dot)christensen(at)crunchydata(dot)com) wrote:
> > Refreshing this with HEAD as of today, v4.
>
> Thanks for updating this!

Thanks for the patience in my response here.

> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will be ultimately used for
> > encrypted data, extended checksums, and potentially other things. This data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with different settings here.
>
> This initial patch, at least, does maintain pg_upgrade as the
> reserved_page_size (maybe not a great name?) is set to 0, right?
> Basically this is just introducing the concept of a reserved_page_size
> and adjusting all of the code that currently uses BLCKSZ or
> PageGetPageSize() to account for this extra space.

Correct; a reserved_page_size of 0 would be the same page format as
currently exists, so you could use pg_upgrade with no page features
and be binary compatible with existing clusters.

> Looking at the changes to bufpage.h, in particular ...
>
> > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
>
> > @@ -19,6 +19,14 @@
> > #include "storage/item.h"
> > #include "storage/off.h"
> >
> > +extern PGDLLIMPORT int reserved_page_size;
> > +
> > +#define SizeOfPageReservedSpace() reserved_page_size
> > +#define MaxSizeOfPageReservedSpace 0
> > +
> > +/* strict upper bound on the amount of space occupied we have reserved on
> > + * pages in this cluster */
>
> This will eventually be calculated based on what features are supported
> concurrently?

Correct; these are fleshed out in later patches.

> > @@ -36,10 +44,10 @@
> > * | v pd_upper |
> > * +-------------+------------------------------------+
> > * | | tupleN ... |
> > - * +-------------+------------------+-----------------+
> > - * | ... tuple3 tuple2 tuple1 | "special space" |
> > - * +--------------------------------+-----------------+
> > - * ^ pd_special
> > + * +-------------+-----+------------+----+------------+
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +-------------------+------------+----+------------+
> > + * ^ pd_special ^ reserved_page_space
>
> Right, adds a dynamic amount of space 'post-special area'.

Dynamic as in "fixed at initdb time" instead of compile time. However,
things are coded in such a way that the page feature bitmap is stored
on a given page, so different pages could have different
reserved_page_size depending on use case/code path. (Basically
preserving future flexibility while minimizing code changes here.) We
could utilize different features depending on what type of page it is,
say, or have different relations or tablespaces with different page
feature defaults.

> > @@ -73,6 +81,8 @@
> > * stored as the page trailer. an access method should always
> > * initialize its pages with PageInit and then set its own opaque
> > * fields.
> > + *
> > + * XXX - update more comments here about reserved_page_space
> > */
>
> Would be good to do. ;)

Next revision... :D

> > @@ -325,7 +335,7 @@ static inline void
> > PageValidateSpecialPointer(Page page)
> > {
> > Assert(page);
> > - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > + Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);
> > Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
> > }
>
> This is just one usage ... but seems like maybe we should be using
> PageGetPageSize() here instead of BLCKSZ, and that more-or-less
> throughout? Nearly everywhere we're using BLCKSZ today to give us that
> compile-time advantage of a fixed block size is going to lose that
> advantage anyway thanks to reserved_page_size being run-time. Now, one
> up-side to this is that it'd also get us closer to being able to support
> dynamic block sizes concurrently which would be quite interesting. That
> is, a special tablespace with a 32KB block size while the rest are the
> traditional 8KB. This would likely require multiple shared buffer
> pools, of course...

I think multiple shared-buffer pools is a ways off; but sure, this
would support this sort of use case as well. I am working on a new
patch for this series (probably the first one in the series) which
will actually just abstract away all existing compile-time usages of
BLCKSZ. This will be a start in that direction and also make the
reserved_page_size patch a bit more reasonable to review.

> > diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
> > index 9a302ddc30..a93cd9df9f 100644
> > --- a/src/backend/storage/page/bufpage.c
> > +++ b/src/backend/storage/page/bufpage.c
> > @@ -26,6 +26,8 @@
> > /* GUC variable */
> > bool ignore_checksum_failure = false;
> >
> > +int reserved_page_size = 0; /* how much page space to reserve for extended unencrypted metadata */
> > +
> >
> > /* ----------------------------------------------------------------
> > * Page support functions
> > @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
> > {
> > PageHeader p = (PageHeader) page;
> >
> > - specialSize = MAXALIGN(specialSize);
> > + specialSize = MAXALIGN(specialSize) + reserved_page_size;
>
> Rather than make it part of specialSize, I would think we'd be better
> off just treating them independently. Eg, the later pd_upper setting
> would be done by:
>
> p->pd_upper = pageSize - specialSize - reserved_page_size;
>
> etc.

I can see that there's a mild readability benefit, but really the
effect is local to PageInit(), so ¯\_(ツ)_/¯... happy to make that
change though.

> > @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
> > * one that is both unused and deallocated.
> > *
> > * If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> > - * MaxHeapTuplesPerPage line pointers on the page.
> > + * MaxHeapTuplesPerPage() line pointers on the page.
>
> Making MaxHeapTuplesPerPage() runtime dynamic is a requirement for
> supporting multiple page sizes concurrently ... but I'm not sure it's
> actually required for the reserved_page_size idea as currently
> considered. The reason is that with 8K or larger pages, the amount of
> space we're already throwing away is at least 20 bytes, if I did my math
> right. If we constrain reserved_page_size to be 20 bytes or less, as I
> believe we're currently thinking we won't need that much, then we could
> perhaps keep MaxHeapTuplesPerPage as a compile-time constant.

In this version we don't have that explicit constraint. In practice I
don't know that we have many more than 20 bytes, at least for the
first few features, but I don't think we can count on that forever
going forward. At some point we're going to have to parameterize
these, so might as well do it in this pass, since how else would you
know that this magic value has been exceeded?

> On the other hand, to the extent that we want to consider having
> variable page sizes in the future, perhaps we do want to change this.
> If so, the approach broadly looks reasonable to me, but I'd suggest we
> make that a separate patch from the introduction of reserved_page_size.

The variable blocksize patch I'm working on includes some of this, so
this will be in the next revision.

> > @@ -211,7 +213,7 @@ PageAddItemExtended(Page page,
> > if (phdr->pd_lower < SizeOfPageHeaderData ||
> > phdr->pd_lower > phdr->pd_upper ||
> > phdr->pd_upper > phdr->pd_special ||
> > - phdr->pd_special > BLCKSZ)
> > + phdr->pd_special + reserved_page_size > BLCKSZ)
> > ereport(PANIC,
> > (errcode(ERRCODE_DATA_CORRUPTED),
> > errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
>
> Probably should add reserved_page_size to that errmsg output? Also,
> this check of pointers seems to be done multiple times- maybe it should
> be moved into a #define or similar?

Sure, can change; agreed it'd be good to have. I just modified the
existing call sites and didn't attempt to change too much else.

[snipped other instances...]

> Making MaxTIDsPerBTreePage dynamic looks to be required as it doesn't
> end up with any 'leftover' space, from what I can tell. Again, though,
> perhaps this should be split out as an independent patch from the rest.
> That is- we can change the higher-level functions to be dynamic in the
> initial patches, and then eventually we'll get down to making the
> lower-level functions dynamic.

Same; should be accounted for in the next variable blocksize patch.
It does have a cascading effect though, so hard to make the high-level
functions dynamic but not the lower-level ones. What is the benefit in
this case for separating those two?

> > diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
> > index efdf9415d1..8ebabdd7ee 100644
> > --- a/contrib/bloom/bloom.h
> > +++ b/contrib/bloom/bloom.h
> > @@ -131,7 +131,7 @@ typedef struct BloomMetaPageData
> > #define BLOOM_MAGICK_NUMBER (0xDBAC0DED)
> >
> > /* Number of blocks numbers fit in BloomMetaPageData */
> > -#define BloomMetaBlockN (sizeof(FreeBlockNumberArray) / sizeof(BlockNumber))
> > +#define BloomMetaBlockN() ((sizeof(FreeBlockNumberArray) - SizeOfPageReservedSpace())/ sizeof(BlockNumber))
> >
> > #define BloomPageGetMeta(page) ((BloomMetaPageData *) PageGetContents(page))
> >
> > @@ -151,6 +151,7 @@ typedef struct BloomState
> >
> > #define BloomPageGetFreeSpace(state, page) \
> > (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
> > + - SizeOfPageReservedSpace() \
> > - BloomPageGetMaxOffset(page) * (state)->sizeOfBloomTuple \
> > - MAXALIGN(sizeof(BloomPageOpaqueData)))
>
> This formulation (or something close to it) tends to happen quite a bit:
>
> (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - SizeOfPageReservedSpace() ...
>
> This is basically asking for "amount of usable space" where the
> resulting 'usable space' either includes line pointers and tuples or
> similar, or doesn't. Perhaps we should break this down into two
> patches- one which provides a function to return usable space on a page,
> and then the patch to add reserved_page_size can simply adjust that
> instead of changing the very, very many places we have this forumlation.

Yeah, I can make this a computed expression; agreed it's pretty common
to have the usable space on the page so really any AM shouldn't know
or care about the details of either header or footer. Since we
already have PageGetContents() I will probably name it
PageGetContentsSize(). The AM can own everything from the pointer
returned by PageGetContents() through said size, allowing for both the
header and reserved_page_size in said computation.

> > diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
> > index d935ed8fbd..d3d74a9d28 100644
> > --- a/contrib/bloom/blutils.c
> > +++ b/contrib/bloom/blutils.c
> > @@ -430,10 +430,10 @@ BloomFillMetapage(Relation index, Page metaPage)
> > */
> > BloomInitPage(metaPage, BLOOM_META);
> > metadata = BloomPageGetMeta(metaPage);
> > - memset(metadata, 0, sizeof(BloomMetaPageData));
> > + memset(metadata, 0, sizeof(BloomMetaPageData) - SizeOfPageReservedSpace());
>
> This doesn't seem quite right? The reserved space is off at the end of
> the page and this is 0'ing the space immediately after the page header,
> if I'm following correctly, and only to the size of BloomMetaPageData...

I think you're correct with that analysis. BloomInitPage() would have
(probably?) had a zero'd page so this underset would have been
unnoticed in practice, but still good to fix.

> > metadata->magickNumber = BLOOM_MAGICK_NUMBER;
> > metadata->opts = *opts;
> > - ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData);
> > + ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData) - SizeOfPageReservedSpace();
>
> Not quite following what's going on here either.

Heh, not sure either. Not sure if there was a reason or a mechanical
replacement, but will look when I do next revisions.

> > diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
> > @@ -116,7 +116,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
> > */
> > if (BloomPageGetMaxOffset(page) != 0 &&
> > BloomPageGetFreeSpace(&state, page) >= state.sizeOfBloomTuple &&
> > - countPage < BloomMetaBlockN)
> > + countPage < BloomMetaBlockN())
> > notFullPage[countPage++] = blkno;
>
> Looks to be another opportunity to have a separate patch making this
> change first before actually changing the lower-level #define's,

#include <stdpatch/variable-blocksize>

> > diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c
> > @@ -217,7 +217,7 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple,
> > * datatype, try to compress it in-line.
> > */
> > if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) &&
> > - VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET &&
> > + VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET() &&
> > (atttype->typstorage == TYPSTORAGE_EXTENDED ||
> > atttype->typstorage == TYPSTORAGE_MAIN))
> > {
>
> Probably could be another patch but also if we're going to change
> TOAST_INDEX_TARGET to be a function we should probably not have it named
> in all-CAPS.

Okay, can make those style changes as well; agreed ALLCAPS should be constant.

> > diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
> > index 43b67893d9..5babbb457a 100644
> > --- a/src/backend/access/nbtree/nbtsplitloc.c
> > +++ b/src/backend/access/nbtree/nbtsplitloc.c
> > @@ -156,7 +156,7 @@ _bt_findsplitloc(Relation rel,
> >
> > /* Total free space available on a btree page, after fixed overhead */
> > leftspace = rightspace =
> > - PageGetPageSize(origpage) - SizeOfPageHeaderData -
> > + PageGetPageSize(origpage) - SizeOfPageHeaderData - SizeOfPageReservedSpace() -
> > MAXALIGN(sizeof(BTPageOpaqueData));
>
> Also here ... though a bit interesting that this uses PageGetPageSize()
> instead of BLCKSZ.

Yeah, a few little exceptions. Variable blocksize patch introduces
those every place it can, and ClusterBlockSize() anywhere it can't.

> > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> > index 011ec18015..022b5eee4e 100644
> > --- a/src/backend/utils/init/globals.c
> > +++ b/src/backend/utils/init/globals.c
> > @@ -154,3 +154,4 @@ int64 VacuumPageDirty = 0;
> >
> > int VacuumCostBalance = 0; /* working state for vacuum */
> > bool VacuumCostActive = false;
> > +
>
> Unnecessary whitespace hunk ?

Will clean up.

Thanks for the review,

David

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tristan Partin 2023-05-30 18:09:47 Re: Documentation for building with meson
Previous Message James Coleman 2023-05-30 17:26:55 Re: An inefficient query caused by unnecessary PlaceHolderVar