Re: Online checksums patch - once again

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums patch - once again
Date: 2020-09-02 12:22:25
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> On 29 Jul 2020, at 19:58, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>> Attached is a new version of the online checksums patch which, I hope, address
>> most of the concerns raised in previous reviews. There has been a fair amount
>> of fiddling done, so below is a summary of what has been done.
> Here are a bunch of comments based on a partial read-through of this
> patch. The most serious concerns, around synchronization, are down
> toward at the bottom. Sorry this is a bit eclectic as a review, but I
> wrote things down as I read through the patch more or less in the
> order I ran across them.

Not need to apologize, many thanks for the review! This is a partial response,
since I need to spend a bit more time to properly respond to the
synchronization questions, but I also wanted to submit a new version which
applies in the CF patch tester. Anything not addressed here will be be in a
follow-up version.

The attached v20 contains fixes from this review as well as Justin's review

> Regarding disable_data_checksums(), I disagree with ereport(LOG, ...)
> here. If you want to indicate to the caller whether or not a state
> change occurred, you could consider returning a Boolean instead of
> void. If you want to do it with log messages, I vote for NOTICE, not
> LOG. Maybe NOTICE is better, because enable_data_checksums() seems to
> want to convey more information that you can represent in a Boolean,
> but then it should use NOTICE consistently, not a mix of NOTICE and
> LOG.

I agree with this, I've moved to returning a bool rather than ereporting NOTICE
(or LOG).

> Formatting needs work for project style: typically no braces around
> single statements, "ereport(WHATEVER," should always have a line break
> at that point.

I think I've fixed all these instances, and the attached patch have been run
through pgindent as well.

> + * cluster, which was not initialized with checksums, this worker will ensure
> "which was not initialized with checksums" => "that does not running
> with checksums enabled"?

Fixed, but it's "run" rather than "running" right?

> + * turned on. In the case of disabling checksums, the state transition is
> + * recorded in the catalog and controlfile, no changes are performed
> + * on the data pages or in the catalog.
> Comma splice. Either write "controlfile; no" or "controlfile, and no".


> My spell-checker complains that controfile, clusterwide, inprogress,
> and endstate are not words. I think you should think about inserting
> spaces or, in the case of cluster-wide, a dash, unless they are being
> used as literals, in which case perhaps those instances should be
> quoted. "havent" needs an apostrophe.

This is me writing Swedish in English. Fixed.

> + * DataChecksumsWorker will compile a list of databases which exists at the
> which exist


> + * For each database, all relations which have storage are read and every data
> + * page is marked dirty to force a write with the checksum, this will generate
> Comma splice. Split into two sentences.


> + * In case checksums have been enabled and later disabled, when re-enabling
> + * pg_class.relhaschecksums will be reset to false before entering inprogress
> + * mode to ensure that all relations are re-processed.
> "If checksums are enabled, then disabled, and then re-enabled, every
> relation's pg_class.relhaschecksums field will be reset to false
> before entering the in-progress mode."

Replaced with your version, thanks.

> + * Disabling checksums is done as an immediate operation as it only updates
> s/done as //


> + * to pg_class.relhaschecksums is performed as it only tracks state during
> is performed -> are necessary


> + * Access to other members can be done without a lock, as while they are
> + * in shared memory, they are never concurrently accessed. When a worker
> + * is running, the launcher is only waiting for that worker to finish.
> The way this is written, it sounds like you're saying that concurrent
> access might be possible when this structure isn't in shared memory.
> But since it's called DatachecksumsWorkerShmemStruct that's not likely
> a correct conclusion, so I think it needs rephrasing.

Right, that was a pretty poorly worded comment. Rewritten and expanded upon.

> + if (DatachecksumsWorkerShmem->launcher_started &&
> !DatachecksumsWorkerShmem->abort)
> + started = true;
> Why not started = a && b instead of started = false; if (a && b) started = true?

I don't have strong feelings either format, so changed according to your suggestion.

> + {
> + LWLockRelease(DatachecksumsWorkerLock);
> + ereport(ERROR,
> + (errmsg("data checksums worker has been aborted")));
> + }
> Errors always release LWLocks, so this seems unnecessary. Also, the
> error message looks confusing from a user perspective. What does it
> mean if I ask you to make me a cheeseburger and you tell me the
> cheeseburger has been eaten? I'm asking for a *new* cheeseburger (or
> in this case, a new worker).

Now you made me hungry for a green chili cheeseburger..

This case covers when the user disables a running datachecksumsworker and
then enables it again before the worker has finished the current page and thus
observed the abort request.

If the worker learns to distinguish between a user abort request and and an
internal cancellation (due to WL_POSTMASTER_DEATH) this window could be handled
by clearing the user request and keep going. It would be the same thing as
killing the worker and restarting, except fewer moving parts. Either way, I
agree that it's a confusing error path, and one which should be addressed.

> I wonder why this thing is inventing a brand new way of aborting a
> worker, anyway. Why not just keep track of the PID and send it SIGINT
> and have it use CHECK_FOR_INTERRUPTS()? That's already sprinkled all
> over the code, so it's likely to work better than some brand-new
> mechanism that will probably have checks in a lot fewer places.

I'm not convinced that the current coding is less responsive, and signalling
for launcher/worker isn't entirely straightforward, but I agree that it's
better to stick to established patterns. Will rewrite to use pqsignal/SIGINT
and will address the previous paragraph in that as well.

> + vacuum_delay_point();
> Huh? Why?

The datachecksumsworker is using the same machinery for throttling as the
cost-based vacuum delay, hence this call. Do you object to using that same
machinery, or the implementation and/or documentation of it?

> + elog(DEBUG2,
> + "background worker \"datachecksumsworker\" starting to process relation %u",
> + relationId);
> This and similar messages seem likely they refer needlessly to
> internals, e.g. this could be "adding checksums to relation with OID
> %u" without needing to reference background workers or
> datachecksumworker.

I have reworded these as well as removed a few that seemed a bit uninteresting.

> It would be even better if we could find a way to
> report relation names.

True, but doesn't really seem worth the overhead for a debug log.

> + * so when the cluster comes back up processing will habe to be resumed.
> habe -> have

Fixed (also noted by Justin upthread).

> + ereport(FATAL,
> + (errmsg("cannot enable checksums without the postmaster process"),
> + errhint("Restart the database and restart the checksumming process
> by calling pg_enable_data_checksums().")));
> I understand the motivation for this design and it may be the best we
> can do, but honestly it kinda sucks. It would be nice if the system
> itself figured out whether or not the worker should be running and, if
> yes, ran it. Like, if we're in this state when we exit recovery (or
> decide to skip recovery), just register the worker then automatically.
> Now that could still fail for lack of slots, so I guess to make this
> really robust we'd need a way for the registration to get retried,
> e.g. autovacuum could try to reregister it periodically, and we could
> just blow off the case where autovacuum=off. I don't know. I'd rather
> avoid burdening users with an implementation detail if we can get
> there, or at least minimize what they need to worry about.

I don't disagree with you, but I don't see how an automatic restart could be
made safe/good enough to be worth the complexity, since it's nigh impossible to
make it always Just Work. Exposing implementation details to users is clearly
not a good design choice, if it can be avoided.

> + snprintf(activity, sizeof(activity) - 1,
> + "Waiting for worker in database %s (pid %d)", db->dbname, pid);
> + pgstat_report_activity(STATE_RUNNING, activity);
> So we only know how to run one such worker at a time?

Correct, there is currently one worker at a time.

> Maybe WaitForAllTransactionsToFinish should advertise something in
> pg_stat_activity.

It currently does this:

"Waiting for current transactions to finish (waiting for %u)",
pgstat_report_activity(STATE_RUNNING, activity);

Did you have anything else in mind?

> I think you should try to give all of the functions header comments,
> or at least all the bigger ones.

I've done a first pass over the patch.

> + /* Abort flag set, so exit the whole process */
> + return false;
> I'd put braces here.


> And also, why bail out like this instead of
> retrying periodically until we succeed?

Currently, the abort request can come from the user disabling data checksums
during processing, postmaster dying or SIGINT. Neither of these cases qualify
for retrying in the current design.

> + * True if all data pages of the relation have data checksums.
> Not fully accurate, right?

Ugh.. thats a leftover from previous hacking that I've since ripped out.
Sorry about that, it's been removed.

> + /*
> + * Force a checkpoint to get everything out to disk. TODO: we probably
> + * don't want to use a CHECKPOINT_IMMEDIATE here but it's very convenient
> + * for testing until the patch is fully baked, as it may otherwise make
> + * tests take a lot longer.
> + */
> Do we need to verify that the checkpoint succeeded before we can
> declare victory and officially change state?

I don't think we need a verification step here. With CHECKPOINT_WAIT we are
blocking until the checkpoint has completed. If it fails we won't enable data
checksums until the process has been restarted and a subsequent checkpoint

> I don't think it's a good idea to have three separate types of barrier
> here. I think you should just have a barrier indicating that the state
> has changed, and then backends need to reread the state from shared
> memory when they absorb the barrier.

I'm not sure I follow why, my understanding of the infrastructure was to make
it finegrained like this. But, you have clearly spent more time thinking about
this functionality so I'm curious to learn. Can you elaborate on your

> But the bigger problem here, and the thing that makes me intensely
> doubtful that the synchronization in this patch is actually correct,
> is that I can't find any machinery in the patch guarding against
> TOCTTOU issues, nor any comments explaining why I shouldn't be afraid
> of them.


I unfortunately haven't had time to read the READ ONLY patch so I can't comment
on how these two patches do things in relation to each other.

The main synchronization mechanisms are the use of the inprogress mode where
data checksums are written but not verified, and by waiting for all
pre-existing non-compatible processes (transactions, temp tables) to disappear
before enabling.

That being handwavily said, I've started to write down a matrix with classes of
possible synchronization bugs and how the patch handles them in order to
properly respond.

cheers ./daniel

Attachment Content-Type Size
online_checksums20.patch application/octet-stream 99.9 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2020-09-02 13:47:52 Re: Kerberos support broken on MSVC builds for Windows x64?
Previous Message Andrey M. Borodin 2020-09-02 12:18:09 Re: [PATCH] Covering SPGiST index