Re: Online checksums patch - once again

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums patch - once again
Date: 2021-01-22 11:55:55
Message-ID: f7ab3338-e3e9-1a3b-4f0a-b93cb1c96a8b@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I read through the latest patch,
v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some
comments below:

On 19/01/2021 14:32, Daniel Gustafsson wrote:
> + /*
> + * Hold interrupts for the duration of xlogging to avoid the state of data
> + * checksums changing during the processing which would later the premise
> + * for xlogging hint bits.
> + */

Sentence sense does not make.

> @@ -904,6 +916,7 @@ static void SetLatestXTime(TimestampTz xtime);
> static void SetCurrentChunkStartTime(TimestampTz xtime);
> static void CheckRequiredParameterValues(void);
> static void XLogReportParameters(void);
> +static void XlogChecksums(ChecksumType new_type);
> static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
> TimeLineID prevTLI);
> static void LocalSetXLogInsertAllowed(void);

Spelling: make it "XLogChecksums" for consistency.

> /*
> * DataChecksumsNeedWrite
> * Returns whether data checksums must be written or not
> *
> * Returns true iff data checksums are enabled or are in the process of being
> * enabled. In case data checksums are currently being enabled we must write
> * the checksum even though it's not verified during this stage. Interrupts
> * need to be held off by the caller to ensure that the returned state is
> * valid for the duration of the intended processing.
> */
> bool
> DataChecksumsNeedWrite(void)
> {
> Assert(InterruptHoldoffCount > 0);
> return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION ||
> LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
> LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
> }

Can you be more precise on the "duration of the intended processing"? It
means, until you have actually written out the page, or something like
that, right? The similar explanation in DataChecksumsNeedVerify() is
easier to understand. The two functions are similar, it would be good to
phrase the comments similarly, so that you can quickly see the
difference between the two.

> /*
> * SetDataChecksumsOnInProgress
> * Sets the data checksum state to "inprogress-on" to enable checksums
> *
> * In order to start the process of enabling data checksums in a running
> * cluster the data_checksum_version state must be changed to "inprogress-on".
> * This state requires data checksums to be written but not verified. The state
> * transition is performed in a critical section in order to provide crash
> * safety, and checkpoints are held off. When the emitted procsignalbarrier
> * has been absorbed by all backends we know that the cluster has started to
> * enable data checksums.
> */

The two "in order" are superfluous, it's more concise to say just "To
start the process ..." (I was made aware of that by Jürgen Purtz's
recent patches that removed a couple of "in order"s from the docs as
unnecessary).

It's a bit confusing to talk about the critical section and
procsignalbarrier here. Does the caller need to wait for the
procsignalbarrier? No, that's just explaining what the function does
internally. Maybe move that explanation inside the function, and say
here something like "This function blocks until all processes have
acknowledged the state change" or something like that.

> /*
> * SetDataChecksumsOn
> * Enables data checksums cluster-wide
> *
> * Enabling data checksums is performed using two barriers, the first one
> * sets the checksums state to "inprogress-on" (which is performed by
> * SetDataChecksumsOnInProgress()) and the second one to "on" (performed here).
> * During "inprogress-on", checksums are written but not verified. When all
> * existing pages are guaranteed to have checksums, and all new pages will be
> * initiated with checksums, the state can be changed to "on".
> */

Perhaps the explanation for how these SetDataChecksumsOn() and
SetDataChecksumsOnInProgress() functions work together should be moved
to one place. For example, explain both functions here at
SetDataChecksumsOn(), and just have a "see SetDataChecksumsOn()" in the
other function, or no comment at all if they're kept next to each other.
As it stands, you have to read both comments and piece together the the
big picture in your head. Maybe add a "see datachecksumsworker.c" here,
since there's a longer explanation of the overall mechanism there.

> +/*
> + * Disables checksums for the cluster, unless already disabled.
> + *
> + * Has immediate effect - the checksums are set to off right away.
> + */
> +Datum
> +disable_data_checksums(PG_FUNCTION_ARGS)
> +{
> + if (!superuser())
> + ereport(ERROR,
> + (errmsg("must be superuser")));
> +
> + StartDatachecksumsWorkerLauncher(false, 0, 0);
> +
> + PG_RETURN_BOOL(true);
> +}

The documentation says "Returns <literal>false</literal> in case data
checksums are disabled already", but the function always returns true.
The enable_data_checksum() function also returns a constant 'true'; why
not make it void?

The "has immediate effect" comment seems wrong, given that it actually
launches a worker process.

> + /*
> + * If the launcher is already started, the only operation we can perform
> + * is to cancel it iff the user requested for checksums to be disabled.
> + * That doesn't however mean that all other cases yield an error, as some
> + * might be perfectly benevolent.
> + */

This comment is a bit hard to understand. Maybe something like

"If the launcher is already started, we cannot launch a new one. But if
the user requested for checksums to be disabled, we can cancel it."

> + if (DatachecksumsWorkerShmem->launcher_started)
> + {
> + if (DatachecksumsWorkerShmem->abort)
> + {
> + ereport(NOTICE,
> + (errmsg("data checksum processing is concurrently being aborted, please retry")));
> +
> + LWLockRelease(DatachecksumsWorkerLock);
> + return;
> + }

If it's being aborted, and the user requested to disable checksum, can
we leave out the NOTICE? I guess not, because the worker process won't
check the 'abort' flag, if it's already finished processing all data pages.

> + /*
> + * If the launcher is started data checksums cannot be on or off, but
> + * it may be in an inprogress state. Since the state transition may
> + * not have happened yet (in case of rapidly initiated checksum enable
> + * calls for example) we inspect the target state of the currently
> + * running launcher.
> + */

This comment contradicts itself. If the state transition has not
happened yet, then contrary to the first sentence, data checksums *are*
currently on or off.

> + if (enable_checksums)
> + {
> + /*
> + * If we are asked to enable checksums when they are already being
> + * enabled, there is nothing to do so exit.
> + */
> + if (DatachecksumsWorkerShmem->enable_checksums)
> + {
> + LWLockRelease(DatachecksumsWorkerLock);
> + return;
> + }
> +
> + /*
> + * Disabling checksums is likely to be a very quick operation in
> + * many cases so trying to abort it to save the checksums would
> + * run the risk of race conditions.
> + */
> + else
> + {
> + ereport(NOTICE,
> + (errmsg("data checksums are concurrently being disabled, please retry")));
> +
> + LWLockRelease(DatachecksumsWorkerLock);
> + return;
> + }
> +
> + /* This should be unreachable */
> + Assert(false);
> + }
> + else if (!enable_checksums)
> + {
> + /*
> + * Data checksums are already being disabled, exit silently.
> + */
> + if (DataChecksumsOffInProgress())
> + {
> + LWLockRelease(DatachecksumsWorkerLock);
> + return;
> + }
> +
> + DatachecksumsWorkerShmem->abort = true;
> + LWLockRelease(DatachecksumsWorkerLock);
> + return;
> + }

The Assert seems unnecessary. The "if (!enable_checkums)" also seems
unnecessary, could be just "else".

> + /*
> + * The launcher is currently not running, so we need to query the system
> + * data checksum state to determine how to proceed based on the requested
> + * target state.
> + */

"query the system" makes me think "checking the catalogs" or similar,
but this just looks at a few variables in shared memory.

> +/*
> + * ShutdownDatachecksumsWorkerIfRunning
> + * Request shutdown of the datachecksumsworker
> + *
> + * This does not turn off processing immediately, it signals the checksum
> + * process to end when done with the current block.
> + */
> +void
> +ShutdownDatachecksumsWorkerIfRunning(void)
> +{
> + LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
> +
> + /* If the launcher isn't started, there is nothing to shut down */
> + if (DatachecksumsWorkerShmem->launcher_started)
> + DatachecksumsWorkerShmem->abort = true;
> +
> + LWLockRelease(DatachecksumsWorkerLock);
> +}

This function is unused.

> +/*
> + * launcher_cancel_handler
> + *
> + * Internal routine for reacting to SIGINT and flagging the worker to abort.
> + * The worker won't be interrupted immediately but will check for abort flag
> + * between each block in a relation.
> + */
> +static void
> +launcher_cancel_handler(SIGNAL_ARGS)
> +{
> + LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
> + DatachecksumsWorkerShmem->abort = true;
> + LWLockRelease(DatachecksumsWorkerLock);
> +}

Acquiring an lwlock in signal handler is not safe.

Since this is a signal handler, it might still get called after the
process has finished processing, and has already set
DatachecksumsWorkerShmem->launcher_started = false. That seems like an
unexpected state.

> +/*
> + * WaitForAllTransactionsToFinish
> + * Blocks awaiting all current transactions to finish
> + *
> + * Returns when all transactions which are active at the call of the function
> + * have ended, or if the postmaster dies while waiting. If the postmaster dies
> + * the abort flag will be set to indicate that the caller of this shouldn't
> + * proceed.
> + */

The caller of this function doesn't check the 'abort' flag AFAICS. I
think you could just use use WL_EXIT_ON_PM_DEATH to error out on
postmaster death.

> + while (true)
> + {
> + int processed_databases = 0;
> +
> + /*
> + * Get a list of all databases to process. This may include databases
> + * that were created during our runtime.
> + *
> + * Since a database can be created as a copy of any other database
> + * (which may not have existed in our last run), we have to repeat
> + * this loop until no new databases show up in the list. Since we wait
> + * for all pre-existing transactions finish, this way we can be
> + * certain that there are no databases left without checksums.
> + */
> + DatabaseList = BuildDatabaseList();

I think it's unnecessary to wait out the transactions on the first
iteration of this loop. There must be at least one database, so there
will always be at least two iterations.

I think it would be more clear to move the
WaitForAllTransactionsToFinish() from BuildDatabaseList() to the callers.

> @@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
> relkind == RELKIND_MATVIEW)
> RelationInitTableAccessMethod(rel);
>
> + /*
> + * Set the data checksum state. Since the data checksum state can change at
> + * any time, the fetched value might be out of date by the time the
> + * relation is built. DataChecksumsNeedWrite returns true when data
> + * checksums are: enabled; are in the process of being enabled (state:
> + * "inprogress-on"); are in the process of being disabled (state:
> + * "inprogress-off"). Since relhaschecksums is only used to track progress
> + * when data checksums are being enabled, and going from disabled to
> + * enabled will clear relhaschecksums before starting, it is safe to use
> + * this value for a concurrent state transition to off.
> + *
> + * If DataChecksumsNeedWrite returns false, and is concurrently changed to
> + * true then that implies that checksums are being enabled. Worst case,
> + * this will lead to the relation being processed for checksums even though
> + * each page written will have them already. Performing this last shortens
> + * the window, but doesn't avoid it.
> + */
> + HOLD_INTERRUPTS();
> + rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
> + RESUME_INTERRUPTS();
> +
> /*
> * Okay to insert into the relcache hash table.
> *

I grepped for relhashcheckums, and concluded that the value in the
relcache isn't actually used for anything. Not so! In
heap_create_with_catalog(), the actual pg_class row is constructed from
the relcache entry, so the value set in RelationBuildLocalRelation()
finds its way to pg_class. Perhaps it would be more clear to pass
relhachecksums directly as an argument to AddNewRelationTuple(). That
way, the value in the relcache would be truly never used.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2021-01-22 11:56:10 mkid reference
Previous Message Greg Nancarrow 2021-01-22 11:49:24 Re: Parallel INSERT (INTO ... SELECT ...)