Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas
Date: 2022-04-07 21:53:26
Message-ID: CAEze2WisHyem0fKtUEUdPW_6C5ckAmGhwCBOLpKLdYC2xk4H=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 7 Apr 2022 at 21:11, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Apr 7, 2022 at 2:43 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > But if we were in a green-field situation we'd probably not want to
> > use up several bytes for a nonse anyway. You said so yourself.
>
> I don't know what statement of mine you're talking about here, and
> while I don't love using up space for a nonce, it seems to be the way
> this encryption stuff works. I don't see that there's a reasonable
> alternative, green field or no.
>
> > > I do understand that there are significant challenges and performance
> > > concerns around having these kinds of initdb-controlled page layout
> > > changes, so the future of that patch is unclear.
> >
> > Why does it need to be at initdb time?
> >
> > Though I cannot prove it, I suspect that the original intent of the
> > special area was to support an additional (though typically small)
> > variable length array, that works a little like the current line
> > pointer array. This array would have to grow backwards (newer items
> > get appended at earlier physical offsets), unlike our line pointer
> > array (which gets appended to at the end, in the simple and obvious
> > way). Growing backwards like this happens with DB systems, that store
> > their line pointer array at the end of the page(the traditional
> > approach from the System R days, I believe).
> >
> > Supporting a variable-length special area array like this would mean
> > that any time you add a new item to the variable-sized array in the
> > special area, the page's entire tuple space has to be memmove()'d
> > backwards by a couple of bytes to create the required space. And so
> > the relevant bufpage.c routine would have to adjust the whole line
> > pointer array such that each lp_off received a compensating
> > adjustment. The array might only be for some kind of page-level
> > transaction metadata, something like that -- shifting it around is
> > pretty expensive (reusing existing slots isn't too expensive, though).
> >
> > Why can't it work like that? You don't really need to build the full
> > set of bufpage.c facilities (though it might not be a bad idea to
> > fully support these variable-length arrays, which seem like they might
> > come in handy). That seems perfectly compatible with what Matthias
> > wants to do, provided we're willing to deem the special area struct
> > (e.g. BTOpaque) as always coming "first" (which is essentially the
> > same as his current proposal anyway). You can even do the same thing
> > yourself for the nonse (use a fixed, known offset), with relatively
> > modest effort. You'd need to have AM-specific knowledge (it would
> > stack right on top of Matthias's technique), but that doesn't seem all
> > that hard. There are plenty of remaining status bits in BTOpaque, and
> > probably all other index AM special areas.
>
> I'm not really following any of this. You seem to be arguing about
> whether it's possible to change the length of the special space
> *later* than initdb time. I agree that might have some use for some
> purpose, but for encryption it's not necessarily all that helpful
> because you have to be able to find the nonce on the page before
> you've decrypted it. If you don't know whether there's a nonce or
> where it's located, you can't do that. What Matthias and I were
> discussing is whether you have to make a decision about appending
> stuff to the special space *earlier* than initdb-time i.e. at compile
> time.
>
> My position is that if we need some space in every page to put a
> nonce, the best place to put it is at the very end of the page, within
> the special space and after anything else that is stored in the
> special space. Code that only manipulates the line pointer array and
> tuple data won't care, because pd_special will just be a bit smaller
> than it would otherwise have been, and none of that code looks at any
> byte offset >= pd_special. Code that looks at the special space won't
> care either, as long as it uses PageGetSpecialPointer to find the
> data, and doesn't examine how large the special space actually is.
> That corresponds pretty well to how existing users of the special
> space work, so it seems pretty good.

Except that reserving space on each page requires recalculation of all
variables that depend on the amount of potential free space available
on a page (for some cases this is less important, for some it is
critical that the value is not wrong). If this is always done at
runtime then that can cause significant overhead.

> If we *didn't* put the nonce at the end of the page, where else would
> we put it? It has to be at a fixed offset, because otherwise you can't
> find it without decrypting the page first, which would be circular.

I think there's no specifically good reason why we'd need to put the
nonce in storage at the same place as where we reserve the space for
the nonce in the unencrypted in-memory format.

> You could put it at the beginning of the page, or after the page
> header and before the line pointer array, but either of those things
> seem likely to affect a lot more code, because there's a lot more
> stuff that accesses the line pointer array than the special space.

I'm not too keen on anything related to having no page layout
guarantees. I can understand needing a nonce; but couldn't that be put
somewhere different than smack-dab in the middle of what are
considered AM-controlled areas?

I'm not certain why we need lsn on a page after we've checked that we
flushed all WAL up to that LSN. That is, right now we store the LSN in
the in-memory representation of the page because we need it to check
that the WAL is flushed up to that point when we write out the page,
so that we can recover the data in case of disk write issues.
But after flushing the WAL to disk, this LSN on the page is not needed
anymore, and could thus be replaced with a nonce. When reading such
page, the LSN-now-nonce can be replaced with the latest flushed LSN to
prevent unwanted xlog flushes. Sure, this limits nonce to 8 bytes, but
if you really need more than that IMO you can recompile from scratch
with a bigger PageHeader.

Benefit of using existing pageheader structs is that we could enable
TDE on a relation level and on existing clusters - there's no extra
space needed right now as there's already some space available.
Yes, we'd lose the ability to skip redo on pages, but I think that's a
small price to pay when you enable TDE.

-Matthias

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2022-04-07 21:53:43 Re: Last day of commitfest
Previous Message Tom Lane 2022-04-07 21:45:09 Re: Can we automatically add elapsed times to tap test log?