Re: Online checksums patch - once again

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums patch - once again
Date: 2021-01-07 14:03:07
Message-ID: 53B0749B-5A83-4E4D-81EB-9AAE685736FE@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 5 Jan 2021, at 21:29, Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
> On Tue, Jan 05, 2021 at 12:18:07AM +0100, Daniel Gustafsson wrote:

>> Attached is a rebase of the patch on top of current HEAD.
>
> Some more comments/questions:

Thanks for reviewing, much appreciated!

>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>> index 53e997cd55..06c001f8ff 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -7284,7 +7284,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
>> * and dirtied.
>> *
>> * If checksums are enabled, we also generate a full-page image of
>> - * heap_buffer, if necessary.
>> + * heap_buffer.
>
> That sounds like it has nothing to do with online (de)activation of
> checksums?

Right, it's a fix which is independent of this which could be broken out into a
separate docs patch along with the suggestions in your previous mail.

>> */
>> XLogRecPtr
>> log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
>> @@ -7305,11 +7305,13 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
>> XLogRegisterBuffer(0, vm_buffer, 0);
>>
>> flags = REGBUF_STANDARD;
>> + HOLD_INTERRUPTS();
>> if (!XLogHintBitIsNeeded())
>> flags |= REGBUF_NO_IMAGE;
>> XLogRegisterBuffer(1, heap_buffer, flags);
>>
>> recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VISIBLE);
>> + RESUME_INTERRUPTS();
>
> This could maybe do with a comment on why the HOLD/RESUME_INTERRUPTS()
> is required here, similar as is done in bufpage.c further down.

Fixed.

>> diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
>> index 92cc7ea073..fa074c6046 100644
>> --- a/src/backend/access/rmgrdesc/xlogdesc.c
>> +++ b/src/backend/access/rmgrdesc/xlogdesc.c
>> @@ -18,6 +18,7 @@
>> #include "access/xlog.h"
>> #include "access/xlog_internal.h"
>> #include "catalog/pg_control.h"
>> +#include "storage/bufpage.h"
>> #include "utils/guc.h"
>> #include "utils/timestamp.h"
>>
>> @@ -140,6 +141,20 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
>> xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,
>> timestamptz_to_str(xlrec.end_time));
>> }
>> + else if (info == XLOG_CHECKSUMS)
>> + {
>> + xl_checksum_state xlrec;
>> +
>> + memcpy(&xlrec, rec, sizeof(xl_checksum_state));
>> + if (xlrec.new_checksumtype == PG_DATA_CHECKSUM_VERSION)
>> + appendStringInfo(buf, "on");
>> + else if (xlrec.new_checksumtype == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)
>> + appendStringInfo(buf, "inprogress-off");
>> + else if (xlrec.new_checksumtype == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
>> + appendStringInfo(buf, "inprogress-on");
>> + else
>> + appendStringInfo(buf, "off");
>
> We probably discussed this earlier, but what was the conclusion?
> PG_DATA_CHECKSUM_VERSION = 1 sounds like somebody thought it a good idea
> to not just have a bool here but they probably rather thought about
> different checkumming/hashing algorithms/implentations than about
> internal state of the checksumming machinery.

Commit 443951748ce4c94b001877c7cf88b0ee969c79e7 explicitly moved from a bool to
be able to handle changes to the checksum field. I don't recall it being
discussed in the context of this patch.

> If we decide on v2 of data page checksums, how would that look like?
>
> PG_DATA_CHECKSUM_VERSION_V2 = 4?

Something like that yes.

> If we think we're done with versions, did we consider removing the
> VERSION here, because it is really confusing in
> PG_DATA_CHECKSUM_INPROGRESS_ON/OFF_VERSION, like
> PG_DATA_CHECKSUM_STATE_ON/INPROGRESS_ON/OFF? Or would removing "version"
> also from pg_controldata be a backwards-incompatible change we don't
> want to do?

We could rename the _INPROGRESS states to not have the _VERSION suffix, but
then we'd end up in a discussion around what a checksum version is, and we'd be
assigning something not named _VERSION to a version field . I think the
current state of the patch is the least surprising.

> Sorry again, I think we discussed it earlier, but maybe at least some
> comments about what VERSION is supposed to be/mean in bufpage.h would be
> in order:

That I don't disagree with, but it can be done separately from this work. I
didn't chase down the thread which led to 443951748ce4c94b but I assume the
discussion there would be useful for documenting this.

>> - doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
>> + doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());
>
> Indent, but I guess this will be indented by pg_indent after the fact
> andyway? Or is this how it's supposed to look?

The patch has been through pgindent so I think it's by design.

>> + * Are checksums enabled, or in the process of being enabled, for data pages?
>
> The second "," looks odd and could be ommitted?.

I think that was correct, but not being a native speaker I might be wrong.
Either way I've reworded the comment to make it clearer since it on the whole
seemed a bit cobbled together.

>> + * to keep the critical section short. This is in order to protect against
>> + * TOCTOU situations around checksum validation.
>
> I had to google "TOCTOU" and this acronym doesn't appear elsewhere in
> the source tree, so I suggest to spell it out at least here (there's one
> more occurance of it in this patch)

Fair enough, both places fixed. The comment in RelationBuildLocalRelation was
also reworded a bit to try and make a tad clearer.

>> + * DataChecksumsOnInProgress
>> + * Returns whether data checksums are being enabled
>> + *
>> + * Most operations don't need to worry about the "inprogress" states, and
>> + * should use DataChecksumsNeedVerify() or DataChecksumsNeedWrite(). The
>> + * "inprogress" state for enabling checksums is used when the checksum worker
>> + * is setting checksums on all pages, it can thus be used to check for aborted
>> + * checksum processing which need to be restarted.
>
> The first "inprogress" looks legit as it talks about both states, but
> the second one should be "inprogress-on" I think and ...

The sentence was as intended, but I agree that spelling out the state name is
better. Fixed.

>> + * DataChecksumsOffInProgress
>> + * Returns whether data checksums are being disabled
>> + *
>> + * The "inprogress" state for disabling checksums is used for when the worker
>> + * resets the catalog state. Operations should use DataChecksumsNeedVerify()
>> + * or DataChecksumsNeedWrite() for deciding whether to read/write checksums.
>
> ... "inprogress-off" here.

Same as the above, but also fixed (with some additional wordsmithing).

>> +void
>> +AbsorbChecksumsOnInProgressBarrier(void)
>> +{
>> + Assert(LocalDataChecksumVersion == 0 || LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>> + LocalDataChecksumVersion = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>> +}
>
> Bikeshed alert: maybe alo those Absorb*Barrier functions could be lumped
> together after the SetDataChecksums*() functions. If not, a function
> comment would be in order.

I moved all the Absorb* functions into a single place since they very similar.

>> +void
>> +SetDataChecksumsOnInProgress(void)

> This function doesn't have the customary function header comment nor any
> comments in the body and it looks like it's doing some pretty important
> stuff, so I think some comments would be in order, e.g.
> explaining that "data_checksum_version != 0" means we've already got
> checksums enabled or are in the process of enabling/disabling them.

That was indeed sloppy, fixed.

>> +/*
>> + * SetDataChecksumsOn
>> + * Enables data checksums cluster-wide
>> + *
>> + * Enabling data checksums is performed using two barriers, the first one
>> + * sets the checksums state to "inprogress-on" and the second one to "on".
>> + * During "inprogress-on", checksums are written but not verified. When all
>> + * existing pages are guaranteed to have checksums, and all new pages will be
>> + * initiated with checksums, the state can be changed to "on".
>
> This should proabably go above SetDataChecksumsOnInProgress() because
> even though I've just reviewed this, I looked at the following function
> body and wondered where the second barrier went...

I kept the ordering but reworded to comment to make it clearer.

>> + * inprogress-off state during which backends continue to write checksums
>
> The inprogress-off is in " " everywhere else.

Fixed.

>> + else
>> + {
>> + /*
>> + * Ending up here implies that the checksums state is "inprogress-on"
>> + * and we can transition directly to "off" from there.
>
> Can you explain that a bit more? Is "inprogress-on" a typo for
> "inprogress-off", or do you really mean that we can just switch off
> checksums during "inprogress-on"? If so, the rationale should be
> explained a bit more.

The reason why we need "inprogress-off", where checksums are written but not
verified, is to ensure that backends still in the "on" state have checksums
updated to not incur verification failures on concurrent writes.

If the state is "inprogress-on" then checksums are being written but not
verified, so we can transition directly to "off" as there are no backends
verifying data checksums so there is to need to keep writing them.

>> @@ -7929,6 +8226,32 @@ StartupXLOG(void)
>> */
>> CompleteCommitTsInitialization();
>>
>> + /*
>> + * If we reach this point with checksums in progress state (either being
>> + * enabled or being disabled), we notify the user that they need to
>> + * manually restart the process to enable checksums. This is because we
>
> I think this could rephrased to "If we reach this point with checksums
> being enabled, we notify..." because the disable case is different and
> handled in the following block.

Correct, that sentence was a leftover from a previous version of this codepath,
fixed.

>> @@ -965,10 +965,13 @@ InsertPgClassTuple(Relation pg_class_desc,
>> /* relpartbound is set by updating this tuple, if necessary */
>> nulls[Anum_pg_class_relpartbound - 1] = true;
>>
>> + HOLD_INTERRUPTS();
>> + values[Anum_pg_class_relhaschecksums - 1] = BoolGetDatum(DataChecksumsNeedWrite());
>> tup = heap_form_tuple(RelationGetDescr(pg_class_desc), values, nulls);
>>
>> /* finally insert the new tuple, update the indexes, and clean up */
>> CatalogTupleInsert(pg_class_desc, tup);
>> + RESUME_INTERRUPTS();
>>
>> heap_freetuple(tup);
>
> Maybe add a comment here why we now HOLD/RESUME_INTERRUPTS.

Fixed.

>> + * a lot of WAL as the entire database is read and written. Once all datapages
>
> It's "data page" with a space in between everywhere else.

Fixed.

>> + * - Backends SHALL NOT violate local datachecksum state
>> + * - Data checksums SHALL NOT be considered enabled cluster-wide until all
>
> Linewrap.

Not sure what you mean here, sentence is too long to fit on one line.

>> + * When Enabling Data Checksums
>> + * ----------------------------
>
> There's something off with the indentation of either the title or the
> line seperator here.

There was a mix of tab and space, fixed to consistently use spaces for
indentation here.

>> + * When the DataChecksumsWorker has finished writing checksums on all pages
>> + * and enable data checksums cluster-wide, there are three sets of backends:
>
> "enables"

Fixed.

>> + * Bg: Backend updating the global state and emitting the procsignalbarrier
>> + * Bd: Backends on "off" state
>
> s/on/in/
>
> Also, given that "When the DataChecksumsWorker has finished writing
> checksums on all pages and enable[s] data checksums cluster-wide",
> shouldn't that mean that all other backends are either in "on" or
> "inprogress-on" state, because the Bd -> Bi transition happened during a
> previous barrier? Maybe that should be first explained?

Right, I've reworded this and wordsmithed a bit to make it a bit less
convoluted.

>> + * Be: Backends in "on" state
>> + * Bi: Backends in "inprogress-on" state
>> + *
>> + * Backends transition from the Bd state to Be like so: Bd -> Bi -> Be
>> + *
>> + * Backends in Bi and Be will write checksums when modifying a page, but only
>> + * backends in Be will verify the checksum during reading. The Bg backend is
>> + * blocked waiting for all backends in Bi to process interrupts and move to
>> + * Be. Any backend starting will observe the global state being "on" and will
>
> "Any backend starting while Bg is waiting for the barrier" right?

Correct, fixed.

>> + * All sets are compatible while still operating based on
>> + * their local state.
>
> Whoa, you lost me there.

What I meant was that Bi and Be are compatible as they both satisfy the
requirement of each other (both write checksums), so they can concurrently
exist in a cluster without false negatives occurring. Reworded.

>> + * When Disabling Data Checksums
>> + * -----------------------------
>> + * A process which fails to observe data checksums being disabled can induce
>> + * two types of errors: writing the checksum when modifying the page and
>
> Can you rephrase what you mean with "being disabled"? If you mean we're
> in the "inprogress-off" state, then why is that an error? Do you mean
> "writing *no* checksum" because AIUI we should still write checksums at
> this point? Or are you talking about a different state?

This was referring to the "off" state, but it wasn't terribly clear. I've
reworded that part.

>> + * validating a data checksum which is no longer correct due to modifications
>> + * to the page.
>> + *
>> + * Bg: Backend updating the global state and emitting the procsignalbarrier
>> + * Bd: Backands in "off" state
>
> s/Backands/Backends/

Fixed.

>> + * Be: Backends in "on" state
>> + * Bi: Backends in "inprogress-off" state
>
> I suggest using a different symbol here for "inprogress-off" in order
> not to confuse the two (different right?) Bi.

Good point, fixed.

>> + * Backends transition from the Be state to Bd like so: Be -> Bi -> Bd
>> + *
>> + * The goal is to transition all backends to Bd making the others empty sets.
>> + * Backends in Bi writes data checksums, but don't validate them, such that
>
> s/writes/write/

Fixed.

>> + * backends still in Be can continue to validate pages until the barrier has
>> + * been absorbed such that they are in Bi. Once all backends are in Bi, the
>> + * barrier to transition to "off" can be raised and all backends can safely
>> + * stop writing data checksums as no backend is enforcing data checksum
>> + * validation.
>
> ... "anymore" maybe.

Makes sense, fixed.

>> + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
>
> Copyright year bump, there's two other occurances.

Fixed.

>> + bool target;
>
> This "target" bool isn't documented very well (at least here). AIUI,
> it's true if we enable checksums and false if we disable them?

Correct, and I agree that the name is quite poor. I renamed to
"enable_checksums" and added a comment.

>> +void
>> +StartDatachecksumsWorkerLauncher(bool enable_checksums, int cost_delay, int cost_limit)
>
> After reading through the function I find this 'bool enable_checksums'
> a bit confusing, I think something like 'int operation' and then
> comparing it to either ENABLE or DISABLE or whatever would make the code
> more readable, but it's a minor nitpick.

Maybe, I personally find enable_checksums to be self-explanatory but I'm
clearly biased. With the small change of renaming "target", do you still think
it should be changed?

>> + * running launcher.
>> + */
>> +
>
> extra newline

Fixed.

>> + if (DatachecksumsWorkerShmem->target)
>
> Would "if (DatachecksumsWorkerShmem->target == enable_checksums)" maybe
> be better, or does that change the meaning?

It would change its meaning as it would always be true. Since renaming target
to enable_checksums I think this condition is quite clear though.

>> + /*
>> + * The launcher is currently not running, so we need to query the system
>> + * data checksum state to determine how to proceed based on the requested
>> + * target state.
>> + */
>> + else
>> + {
>
> That's a slightly weird comment placement, maybe put it below the '{' so
> that that the '} else {' is kept intact?

This style is used in a number of places in the patch, and it's used around the
tree as well. If it's a common concern/complaint then I'm happy to change them
all.

>> + memset(DatachecksumsWorkerShmem->operations, 0, sizeof(DatachecksumsWorkerShmem->operations));
>> + DatachecksumsWorkerShmem->target = enable_checksums;
>
> This one is especially confusing as it looks like we're unconditionally
> enabling checksums but instead we're just setting the target based on
> the bool (see above). It might be our standard notation though.

After the rename of the variable to enable_checkums it reads pretty clear that
this operation is caching the passed in value. What do you think about the
codepath now?

>> + /*
>> + * If the postmaster crashed we cannot end up with a processed database so
>> + * we have no alternative other than exiting. When enabling checksums we
>> + * won't at this time have changed the pg_control version to enabled so
>> + * when the cluster comes back up processing will have to be resumed. When
>> + * disabling, the pg_control version will be set to off before this so
>> + * when the cluster comes up checksums will be off as expected. In the
>> + * latter case we might have stale relhaschecksums flags in pg_class which
>> + * need to be handled in some way. TODO
>
> Any idea on how? Or is that for a future feature and can just be documented
> for now? In any case, one can just run pg_disable_checksums() again as
> mentioned elsewhere so maybe just rephrase the comment to say the admin
> needs to do that?

I've reworded the comment to be a nice to have rather than a need to have,
since stale flags have no practical implication. Regarding how to address it
I'm not really sure what the cleanest way would be, and I've deliberately held
off on it since it's mostly cosmetical and this patch is complicated enough as
it is.

>> + /*
>> + * Set up local cache of Controldata values.
>> + */
>> + InitLocalControldata();
>
> This just sets LocalDataChecksumVersion for now, is it expected to cache
> other ControlData values in the future? Maybe clarifying the current
> state in the comment would be in order.

I'm a bit hesitant to write in the comment that it's only the data checksums
state right now, since such a comment is almost guaranteed to be missed and
become stale when/if InitLocalControldata gains more capabilities. Instead
I've added a function comment in xlog.c on InitLocalControldata.

>> -static bool data_checksums;
>> +static int data_checksums_tmp;
>
> Why the _tmp, is that required for enum GUCs?

That is a missed leftover from an earlier version, removed.

Attached is a rebase with the above fixes, thanks for review!

cheers ./daniel

Attachment Content-Type Size
online_checksums27.patch application/octet-stream 130.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2021-01-07 14:05:44 Re: Online checksums patch - once again
Previous Message Dean Rasheed 2021-01-07 13:49:50 Re: PoC/WIP: Extended statistics on expressions