Re: Online checksums patch - once again

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Online checksums patch - once again
Date: 2020-11-23 17:36:06
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17/11/2020 10:56, Daniel Gustafsson wrote:
>> On 5 Oct 2020, at 13:36, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> 2. The signaling between enable_data_checksums() and the launcher process looks funny to me. The general idea seems to be that enable_data_checksums() just starts the launcher process, and the launcher process figures out what it need to do and makes all the changes to the global state. But then there's this violation of the idea: enable_data_checksums() checks DataChecksumsOnInProgress(), and tells the launcher process whether it should continue a previously crashed operation or start from scratch. I think it would be much cleaner if the launcher process figured that out itself, and enable_data_checksums() would just tell the launcher what the target state is.
>> enable_data_checksums() and disable_data_checksums() seem prone to race conditions. If you call enable_data_checksums() in two backends concurrently, depending on the timing, there are two possible outcomes:
>> a) One call returns true, and launches the background process. The other call returns false.
>> b) Both calls return true, but one of them emits a "NOTICE: data checksums worker is already running".
>> In disable_data_checksum() imagine what happens if another backend calls enable_data_checksums() in between the ShutdownDatachecksumsWorkerIfRunning() and SetDataChecksumsOff() calls.
> I've reworked this in the attached such that the enable_ and disable_ functions
> merely call into the launcher with the desired outcome, and the launcher is
> responsible for figuring out the rest. The datachecksumworker is now the sole
> place which initiates a state transfer.

Well, you still fill the DatachecksumsWorkerShmem->operations array in
the backend process that launches the datacheckumworker, not in the
worker process. I find that still a bit surprising, but I believe it works.

>>> /*
>>> * Mark the buffer as dirty and force a full page write. We have to
>>> * re-write the page to WAL even if the checksum hasn't changed,
>>> * because if there is a replica it might have a slightly different
>>> * version of the page with an invalid checksum, caused by unlogged
>>> * changes (e.g. hintbits) on the master happening while checksums
>>> * were off. This can happen if there was a valid checksum on the page
>>> * at one point in the past, so only when checksums are first on, then
>>> * off, and then turned on again. Iff wal_level is set to "minimal",
>>> * this could be avoided iff the checksum is calculated to be correct.
>>> */
>>> MarkBufferDirty(buf);
>>> log_newpage_buffer(buf, false);
>> It's really unfortunate that we have to dirty the page even if the checksum already happens to match. Could we only do the log_newpage_buffer() call and skip MarkBufferDirty() in that case?
> I think we can, but I've intentionally stayed away from such optimizations
> until the basic version of the patch was deemed safe and approaching done.
> It's complicated enough as it is IMO.

Fair enough.

> The attached fixes the above plus a few other small things found while hacking
> on this version.

I haven't read through the whole patch, but a few random comments below,
in no particular order:

pg_enable/disable_data_checksums() should perform a superuser-check. I
don't think we want to expose them to any users.

> +/*
> + * Local state for Controlfile data_checksum_version. After initialization,
> + * this is only updated when absorbing a procsignal barrier during interrupt
> + * processing. Thus, it can be read by backends without the need for a lock.
> + * Possible values are the checksum versions defined in storage/bufpage.h and
> + * zero for when checksums are disabled.
> + */
> +static uint32 LocalDataChecksumVersion = 0;

The comment is a bit confusing: "Thus, it can be read by backends
without the need for a lock". Since it's a variable in backend-private
memory, it can only be read by the same backend, not "backends". And
that's also why you don't need a lock, not because it's updated during
interrupt processing. I understand how this works, but maybe rephrase
the comment a bit.

> +/*
> + * DataChecksumsOnInProgress
> + * Returns whether data checksums are being enabled
> + *
> + * Most callsites shouldn't need to worry about the "inprogress" states, since
> + * they should check the requirement for verification or writing. Some low-
> + * level callsites dealing with page writes need however to know. It's also
> + * used to check for aborted checksum processing which need to be restarted.
> */
> bool
> -DataChecksumsEnabled(void)
> +DataChecksumsOnInProgress(void)
> +{
> + return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
> +}

s/need/needs/. The whole paragraph feels a bit awkwardly worded in
general. I'd suggest something like: "Most operations don't need to
worry about the "inprogress" states, and should use
DataChecksumsNeedVerify() or DataChecksumsNeedWrite()".
DataChecksumsOffInProgress() is called from
StartDatachecksumsWorkerLauncher(), which I wouldn't call a "low-level

> @@ -1079,7 +1091,7 @@ XLogInsertRecord(XLogRecData *rdata,
> Assert(RedoRecPtr < Insert->RedoRecPtr);
> RedoRecPtr = Insert->RedoRecPtr;
> }
> - doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
> + doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsOnInProgress());
> if (doPageWrites &&
> (!prevDoPageWrites ||

The comment above this deserves to be updated. Also, this is a very hot
codepath; will this slow down WAL-logging, when full-page writes are
disabled? Could we inline DataChecksumsOnInProgress() or set
Insert->forcePageWrites when checksums are being computed or something?

In StartupXLOG():
> + /*
> + * If data checksums were being disabled when the cluster was shutdown, we
> + * know that we have a state where all backends have stopped validating
> + * checksums and we can move to off instead.
> + */
> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)
> + {
> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> + ControlFile->data_checksum_version = 0;
> + LWLockRelease(ControlFileLock);
> + }
> +

Should this be WAL-logged, like in SetDataChecksumsOff()?

In SetDataChecksumsOff():
> + ControlFile->data_checksum_version = 0;
> + UpdateControlFile();
> + LWLockRelease(ControlFileLock);
> +
> + XlogChecksums(0);
> + WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF));
> +}

What happens is if you crash between UpdateControlFile() and XlogChecksum()?

- Heikki

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2020-11-23 17:44:57 Re: truncating timestamps on arbitrary intervals
Previous Message Tom Lane 2020-11-23 16:39:29 Re: Bogus documentation for bogus geometric operators