Re: better page-level checksums

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: better page-level checksums
Date: 2022-06-14 15:08:43
Message-ID: CAEze2WgVcxNbPBUJNENY7V3-+7qV9Wfr4xLhoTdw1TyE_M09OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 14 Jun 2022 at 14:56, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Jun 13, 2022 at 5:14 PM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > It's not that I disagree with (or dislike the idea of) increasing the
> > resilience of checksums, I just want to be very careful that we don't
> > trade (potentially significant) runtime performance for features
> > people might not use. This thread seems very related to the 'storing
> > an explicit nonce'-thread, which also wants to reclaim space from a
> > page that is currently used by AMs, while AMs would lose access to
> > certain information on pages and certain optimizations that they could
> > do before. I'm very hesitant to let just any modification to the page
> > format go through because someone needs extra metadata attached to a
> > page.
>
> Right. So, to be clear, I think there is an opportunity to store ONE
> extra blob of data in the page. It might be an extended checksum, or
> it might be a nonce for cryptographic authentication, but it can't be
> both. I think this is OK, because in earlier discussions of TDE, it
> seems that if you're using encryption and also want to verify page
> integrity, you'll use an encryption system that produces some kind of
> verifier, and you'll store that into this space in the page instead of
> using an enhanced-checksum feature.

Agreed.

> In other words, I'm imagining creating a space at the end of each page
> for some sort of enhanced security or data integrity feature, and you
> can either choose not to use one (in which case things work as they do
> today), or you can choose an extended checksums feature, or maybe in
> the future you can choose some form of TDE that involves storing a
> nonce or a page verifier in the page. But you just get one.
>
> Now, the logical question to ask is: well, if there's only one
> opportunity to store an extra blob of data on every page, is this the
> best way to use it? What if someone comes along with another feature
> that also wants to store a blob of data on every page, and they can't
> do it because this proposal got there first? My answer is: well, if
> that additional feature is something that provides encryption or
> tamper-resistance or data integrity or security in any form, then it
> can just be added as a new option for how you use this blob of space,
> and users who prefer the new thing to the existing options can pick
> it. If it's something else, then .... what is it, exactly? It seems to
> me that the kinds of things that require space in *every* page of the
> cluster are really the things that fall into this category.
>
> For example, Stephen mused earlier that maybe while we're at it we
> could find a way to include an XID epoch in every page. Maybe so, but
> we wouldn't actually want that in *every* page. We would only want it
> in the heap pages. And as far as I can see that's pretty generally how
> things go. There are plenty of projects that might want extra space in
> each page *for a certain AM* and I don't see any reason why what I
> propose to do here would rule that out. I think this and that could
> both be done, and doing this might even make doing that easier by
> putting in place some useful infrastructure. What I don't think we can
> get away with is having multiple systems that are each taking a bite
> out of every page for every AM -- but I think that's OK, because I
> don't think there's a lot of need for multiple such systems.

I agree with the premise of one only needing one such blob on the
page, yet I don't think that putting it on the exact end of the page
is the best option.

PageGetSpecialPointer is much simpler when you can rely on the
location of the special area. As special areas can be accessed N times
each time a buffer is loaded from disk, and yet the 'storage system
extra blob' only twice (once read, once write), I think the special
area should have priority when handing out page space.

> > That reminds me, there's one more item to be put on the compatibility
> > checklist: Currently, the FSM code assumes it can use all space on a
> > page (except the page header) for its total of 3 levels of FSM data.
> > Mixing page formats would break how it currently works, as changing
> > the space that is available on a page will change the fanout level of
> > each leaf in the tree, which our current code can't handle. To change
> > the page format of one page in the FSM would thus either require a
> > rewrite of the whole FSM fork, or extra metadata attached to the
> > relation that details where the format changes. A similar issue exists
> > with the VM fork.
>
> I agree with all of this except I think that "mixing page formats" is
> a thing we can't do.

I'm not sure it's impossible, but I would indeed agree it would not be
a trivial issue to solve.

> > That being said, I think that it could be possible to reuse
> > pd_checksum as an extra area indicator between pd_upper and
> > pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole]
> > pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area].
> > This should require limited rework in current AMs, especially if we
> > provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some
> > upper limit on how much overhead the storage uses per page.
>
> This is an interesting alternative. It's unclear to me that it makes
> anything better if the [blackbox] area is before the special area vs.
> afterward.

The main benefit of this order is that an AM will see it's special
area at a fixed location if it always uses a fixed-size Opaque struct,
i.e. that an AM may still use (Page + BLCKSZ - sizeof(IndexOpaque)) as
seen in [0]. There might be little to gain, but alternatively there's
also little to lose for the storage system -- page read/write to the
FS happens at most once for each time the page is accessed/written to.
I'd thus much rather let the IO subsystem pay this cost than the AM,
as when you'd offload this cost to the AM that would be a constant
overhead for all in-memory operations, while if it were offloaded to
the IO it would only be felt once per swapped block, on average.

The best point for this layout is that this lets us determine what the
data on each page is for without requiring access to shmem variables.
Appending or prepending storage-special areas to the pd_special area
would confuse AMs about what data is theirs on the page -- making it
explicit in the page format would remove this potential for
confustion, while allowing this storage-blob area to be dynamically
sized.

> And either way, if that area is fixed-size across the
> cluster, you don't really need to use pd_checksum to find it, because
> you can just know where it is. A possible advantage of this approach
> is that it might make it simpler to cope with a scenario where some
> pages in the cluster have this blackbox space and others don't. I
> wasn't really thinking that on-line page format conversions were
> likely to be practical, but certainly the chances are better if we've
> got an explicit pointer to the extra space vs. just knowing where it
> has to be.
>
> > Alternatively, we could claim some space on a page using a special
> > line pointer at the start of the page referring to storage data, while
> > having the same limitation on size.
>
> That sounds messy.

Yep. It isn't my first choice neither, but it is something that I did
consider - it has the potentially desirable effect of the AM being
able to relocate this blob.

> > One last option is we recognise that there are two storage locations
> > of pages that have different data requirements -- on-disk that
> > requires checksums, and in-memory that requires LSNs. Currently, those
> > fields are both stored on the page in distinct fields, but we could
> > (_could_) update the code to drop LSN when we store the page, and drop
> > the checksum when we load the page (at the cost of redo speed when
> > recovering from an unclean shutdown). That would provide an extra 64
> > bits on the page without breaking storage, assuming AMs don't already
> > misuse pd_lsn.
>
> It seems wrong to me to say that we don't need the LSN for a page
> stored on disk. Recovery relies on it.

It's not critical for recovery, "just" very useful; but indeed this
too isn't great.

- Matthias

[0] https://commitfest.postgresql.org/38/3543
[1] https://www.postgresql.org/message-id/CA+TgmoaD8wMN6i1mmuo+4ZNeGE3Hd57ys8uV8UZm7cneqy3W2g@mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-06-14 15:47:49 Re: better page-level checksums
Previous Message Justin Pryzby 2022-06-14 14:30:06 Re: PG15 beta1 fix pg_stats_ext/pg_stats_ext_exprs view manual