Re: Changing the state of data checksums in a running cluster

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-03-17 22:01:13
Message-ID: BE93475D-504E-468E-867F-883749092DB9@yesql.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 17 Mar 2026, at 12:45, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> Here's a quick, non-exhaustive review of v20260316 of the patch. I haven't followed the thread, so forgive me if some of these things have already been discussed.
>
> I don't see any major issues, just a bunch of small things:

Thanks you so much for looking!

>> +/*
>> + * Checksum version 0 is used for when data checksums are disabled (OFF).
>> + * PG_DATA_CHECKSUM_VERSION defines that data checksums are enabled in the
>> + * cluster and PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF}_VERSION defines that data
>> + * checksums are either currently being enabled or disabled.
>> + */
>> +typedef enum ChecksumType
>> +{
>> + PG_DATA_CHECKSUM_OFF = 0,
>> + PG_DATA_CHECKSUM_VERSION,
>> + PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION,
>> + PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION
>> +} ChecksumType;
>
> Naming: This is called "VERSION", but also "Type", and I'm a little confused. I think the field was called "version" because we envisioned that we might want to use a different checksum algorithm in the future, which could then be indicated by a different checksum version number. If we do that in the future, how will these values look like?

This is an old leftover the patch has carried for too long, the inprogress
states were named _VERSION simply due to the ON state being so. I've renamed
them to just INPROGRESS_{ON|OFF} and made sure to place the ON state as the
last element in the enum. If we add other checksum algorithms in the future
there are different naming schemes we could use depending on which level of
implementation abstraction we want. We can either append a version number (and
consider the current ON state to be version 1 or 0), or an algorithm name (with
the current ON state a defuault algorithm).

* PG_DATA_CHECKSUM_<alg>_VERSION
* PG_DATA_CHECKSUM_VERSION_<n>

To further reduce this naming confusing I have done the following:

s/LocalDataChecksumVersion/LocalDataChecksumState/
s/ChecksumType/ChecksumStateType/
s/xl_checksum_state->new_checksumtype/xl_checksum_state->new_checksum_state/
s/Checkpoint->dataChecksumVersion/Checkpoint->dataChecksumState/

>> @@ -831,9 +852,10 @@ XLogInsertRecord(XLogRecData *rdata,
>> * only happen just after a checkpoint, so it's better to be slow in
>> * this case and fast otherwise.
>> *
>> - * Also check to see if fullPageWrites was just turned on or there's a
>> - * running backup (which forces full-page writes); if we weren't
>> - * already doing full-page writes then go back and recompute.
>> + * Also check to see if fullPageWrites was just turned on, there's a
>> + * running backup or if checksums are enabled (all of which forces
>> + * full-page writes); if we weren't already doing full-page writes
>> + * then go back and recompute.
>> *
>> * If we aren't doing full-page writes then RedoRecPtr doesn't
>> * actually affect the contents of the XLOG record, so we'll update
>
> It's true that if checksums are enabled, we must do full-page writes. But that's already included in the "fullpageWrites" flag. We're not specifically checking whether checksums are enabled here, so I find this comment change more confusing than helpful.

Fair enough, removed.

>> void
>> SetDataChecksumsOnInProgress(void)
>> {
>> uint64 barrier;
>> Assert(ControlFile != NULL);
>> /*
>> * The state transition is performed in a critical section with
>> * checkpoints held off to provide crash safety.
>> */
>> START_CRIT_SECTION();
>> MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>> XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>> SpinLockAcquire(&XLogCtl->info_lck);
>> XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>> SpinLockRelease(&XLogCtl->info_lck);
>> barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON);
>> MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
>> END_CRIT_SECTION();
>> /*
>> * Update the controlfile before waiting since if we have an immediate
>> * shutdown while waiting we want to come back up with checksums enabled.
>> */
>> LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>> ControlFile->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>> UpdateControlFile();
>> LWLockRelease(ControlFileLock);
>> /*
>> * Await state change in all backends to ensure that all backends are in
>> * "inprogress-on". Once done we know that all backends are writing data
>> * checksums.
>> */
>> WaitForProcSignalBarrier(barrier);
>> }
>
> If a checkpoint starts, finishes, and you then crash, all between the END_CRIT_SECTION() and LWLockAcquire here, are you left with with wrong 'data_checksum_version' in the control file? (Highly theoretical, the window is minuscule, but still)

Thats true. In that case the REDO record should have the right value though so
during replay it will be handled.

>> +/*
>> + * InitLocalControlData
>> + *
>> + * Set up backend local caches of controldata variables which may change at
>> + * any point during runtime and thus require special cased locking. So far
>> + * this only applies to data_checksum_version, but it's intended to be general
>> + * purpose enough to handle future cases.
>> + */
>> +void
>> +InitLocalDataChecksumVersion(void)
>> +{
>> + SpinLockAcquire(&XLogCtl->info_lck);
>> + SetLocalDataChecksumVersion(XLogCtl->data_checksum_version);
>> + SpinLockRelease(&XLogCtl->info_lck);
>> +}
>
> Function name in the comment doesn't match.

Fixed.

>> + ereport(WARNING,
>> + errmsg("data checksums state has been set to off"),
>> + errhint("If checksums were being enabled during shutdown then processing must be manually restarted."));
>
> This is a little unclear. If we reach this, we know that the checkum calculation was interrupted, no need to caveat it with "if checksums were being enabled", right? Maybe something like:
>
> WARNING: enabling data checksums was interrupted
> HINT: The checksum processing must be manually restarted
>
> I feel we're missing a term for the process of enabling checkums. "checksum enablement"? "initial checksum calculation"? In some comments it's called "checksumming".

I'm not a fan of "checksumming" so I've gone over all occurrences and reworded
them.

>> * The DataChecksumsWorker will compile a list of databases which exist at the
>> * start of checksumming, and once all are processed will regenerate the list
>> * and start over processing any new entries. Once there are no new entries on
>> * the list, processing will end. All databases MUST BE successfully processed
>> * in order for data checksums to be enabled, the only exception are databases
>> * which are dropped before having been processed.
>
> Why does the list need to be regenerated? Any newly created databases would already have checksums enabled, right? Is this because you might use database that hadn't been fully processed yet as the template? Would be good to mention if that's the reason.

Correct, we cannot rely on the initial list being the full list. Comment
expanded.

> Could CREATE DATABASE calculate the checksums when it copies the files?

I guess it could, though I think that should be a separate patch as this is
already (too) big. It's a good idea though, added to the "ideas for future
optimizations" section in the documentation comment.

>> * means that connected backends will change state at different times. If
>> * waiting for a barrier is done during startup, for example during replay, it
>> * is important to realize that any locks held by the startup process might
>> * cause deadlocks if backends end up waiting for those locks while startup
>> * is waiting for a procsignalbarrier.
>
> Don't we absorb procsignalbarriers while waiting on a lock? Or does this mean LWlocks?

It is referring to LWlocks.

>> * Backends transition Bd -> Bi via a procsignalbarrier which is emitted by the
>> * DataChecksumsLauncher. When all backends have acknowledged the barrier then
>> * Bd will be empty and the next phase can begin: calculating and writing data
>> * checksums with DataChecksumsWorkers. When the DataChecksumsWorker processes
>> * have finished writing checksums on all pages, data checksums are enabled
>> * cluster-wide via another procsignalbarrier. There are four sets of backends
>> * where Bd shall be an empty set:
>> *
>> * Bg: Backend updating the global state and emitting the procsignalbarrier
>> * Bd: Backends in "off" state
>> * Be: Backends in "on" state
>> * Bi: Backends in "inprogress-on" state
>> *
>> * Backends in Bi and Be will write checksums when modifying a page, but only
>> * backends in Be will verify the checksum during reading. The Bg backend is
>> * blocked waiting for all backends in Bi to process interrupts and move to
>> * Be. Any backend starting while Bg is waiting on the procsignalbarrier will
>> * observe the global state being "on" and will thus automatically belong to
>> * Be. Checksums are enabled cluster-wide when Bi is an empty set. Bi and Be
>> * are compatible sets while still operating based on their local state as
>> * both write data checksums.
>
> Isn't "Bg" always the data checksums launcher process? In the previous paragraph before this, you didn't list the data checksums launcher as a separate "set".

Good catch, added.

>> /*
>> * Configuration of conditions which must match when absorbing a procsignal
>> * barrier during data checksum enable/disable operations. A single function
>> * is used for absorbing all barriers, and the set of conditions to use is
>> * looked up in the checksum_barriers struct. The struct member for the target
>> * state defines which state the backend must currently be in, and which it
>> * must not be in.
>> *
>> * The reason for this explicit checking is to ensure that processing cannot
>> * be started such that it breaks the assumptions of the state machine. See
>> * datachecksumsworker.c for a lengthy discussion on these states.
>
> This is "datachecksumworker.c", aka "datachecksum_state.c", so this "See datachecksumworker.c" refers to itself.

Ugh, I missed fixing this when I recently moved code out of xlog.c. Fixed.

>> *
>> * MAX_BARRIER_CONDITIONS must match largest number of sets in barrier_eq and
>> * barrier_ne in the below checksum_barriers definition.
>> */
>> #define MAX_BARRIER_CONDITIONS 2
>> typedef struct ChecksumBarrierCondition
>> {
>> /* The target state of the barrier */
>> int target;
>> /* A set of states in which at least one MUST match the current state */
>> int barrier_eq[MAX_BARRIER_CONDITIONS];
>> /* The number of elements in the barrier_eq set */
>> int barrier_eq_sz;
>> /* A set of states which all MUST NOT match the current state */
>> int barrier_ne[MAX_BARRIER_CONDITIONS];
>> /* The number of elements in the barrier_ne set */
>> int barrier_ne_sz;
>> } ChecksumBarrierCondition;
>
> I don't understand the logic here. Why do you need both 'barrier_eq' and 'barrier_ne'? Wouldn't it be enough to just list the allowed states, and everything else is not allowed?

It was in an attempt to make it more readable, to avoid long lists of states
which look quite similar. It could be handled with just a set of required
allowed states if that's what is preferred, I don't have strong opinions myself
really.

>> static const ChecksumBarrierCondition checksum_barriers[4] =
>> {
>> {PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2, {PG_DATA_CHECKSUM_VERSION}, 1},
>> {PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
>> {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {0}, 0, {PG_DATA_CHECKSUM_VERSION}, 1},
>> {PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
>> };
>
> This is pretty jarring to read. Maybe use C99 designated initializers, or at least split over multiple lines. Or a plain array of of allowed transitions, (current, target) -> bool. Or a function with switch-case statement.

I moved to designated initializers which I think made quite the improvement.
This code was at first in one function per barrier, with the states encoded in
if/then/else, later it was merged into a single function. The logic was
however getting so convoluted that several subtle bugs were found there, after
moving to this configuration-driven approach it has been more reliable and (IMHO)
easier to review.

>> /*
>> * If the postmaster crashed we cannot end up with a processed database so
>> * we have no alternative other than exiting. When enabling checksums we
>> * won't at this time have changed the pg_control version to enabled so
>> * when the cluster comes back up processing will have to be restarted.
>> * When disabling, the pg_control version will be set to off before this
>> * so when the cluster comes up checksums will be off as expected.
>> */
>> if (status == BGWH_POSTMASTER_DIED)
>> ereport(FATAL,
>> errcode(ERRCODE_ADMIN_SHUTDOWN),
>> errmsg("cannot enable data checksums without the postmaster process"),
>> errhint("Restart the database and restart data checksum processing by calling pg_enable_data_checksums()."));
>
> Do we reach here when disabling checksums? The comment suggests so, but in that case the hint would be wrong.

The comment is wrong, we only process databases when enabling checksums so
during disabling we cannot reach here. Fixed.

>> /*
>> * launcher_exit
>> *
>> * Internal routine for cleaning up state when the launcher process exits. We
>> * need to clean up the abort flag to ensure that processing started again if
>> * it was previously aborted (note: started again, *not* restarted from where
>> * it left off).
>> */
>> static void
>> launcher_exit(int code, Datum arg)
>> {
>> if (launcher_running)
>> {
>> LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
>> launcher_running = false;
>> DataChecksumsWorkerShmem->launcher_running = false;
>> /*
>> * TODO: how to really handle the worker still running when the
>> * launcher exits?
>> */
>> if (DataChecksumsWorkerShmem->worker_running)
>> ereport(LOG,
>> errmsg("data checksums launcher exiting while worker is still running"));
>> LWLockRelease(DataChecksumsWorkerLock);
>> }
>> }
>
> What "abort flag"?

There is an abort_requested flag which is set by the signal handler, but I'm
not sure why the clearing of the flag wasn't done here as the comment states it
should be. Fixed.

>> /*
>> * Get a list of all databases to process. This may include databases that
>> * were created during our runtime. Since a database can be created as a
>> * copy of any other database (which may not have existed in our last
>> * run), we have to repeat this loop until no new databases show up in the
>> * list. Here the initial list for the loop processing is generated after
>> * waiting for all existing transactions to finish to ensure that we can
>> * see any database which was created even if the transaction in which it
>> * was created started before checksums were being processed.
>> */
>> WaitForAllTransactionsToFinish();
>> DatabaseList = BuildDatabaseList();
>
> What if a database is dropped and another one is created with the same database OID?

Good point. We have the name of the database so we could compare Oid and name,
and require both to match. That would still be foiled by a new database using
the same Oid and name though. Any ideas on what could be done?

>> /*
>> * This is the only place where we check if we are asked to abort, the
>> * abortion will bubble up from here. It's safe to check this without
>> * a lock, because if we miss it being set, we will try again soon.
>> */
>> Assert(operation == ENABLE_DATACHECKSUMS);
>> LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
>> if (DataChecksumsWorkerShmem->launch_operation == DISABLE_DATACHECKSUMS)
>> abort_requested = true;
>> LWLockRelease(DataChecksumsWorkerLock);
>
> The comment justifies that it's safe to do this without a lock, but it grabs the lock anyway.

I have a feeling that's an old comment that was missed when updating, we should
grab a lock here. Fixed.

>> @@ -107,7 +107,8 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
>> */
>> if (!PageIsNew(page))
>> {
>> - if (DataChecksumsEnabled())
>> + HOLD_INTERRUPTS();
>> + if (DataChecksumsNeedVerify())
>> {
>> checksum = pg_checksum_page(page, blkno);
>> @@ -118,6 +119,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
>> *checksum_failure_p = true;
>> }
>> }
>> + RESUME_INTERRUPTS();
>> /*
>> * The following checks don't prove the header is correct, only that
>
> This is to ensure that we don't absorb a barrier while calculating the checksum, which could transition us from "on" to "off" ? This is probably not really needed because there are no CHECK_FOR_INTERRUPTS() calls in pg_checksum_page(), but better safe than sorry I guess. Another approach would be to check DataChecksumsNeedVerify() again if the checksum didn't match. In any case, a comment would be in order.

Right, it's a belts-and-suspenders type of safety measure. Comment added.

> s/datachecksumsworker.c/datachecksum_state.c/

Fixed

> DataChecksumsOnInProgress() and DataChecksumsOffInProgress() are unused.

Fixed

> Attached are a few more trivial fixes, in patch format.

Thanks, applied.

The above changes are kept in 0003 in order to make them easier to see in this
somewhat large patchset.

--
Daniel Gustafsson

Attachment Content-Type Size
v20260317-0001-Add-proper-WAL-record-for-XLOG_CHECKPOINT_.patch application/octet-stream 3.9 KB
v20260317-0002-Online-enabling-and-disabling-of-data-chec.patch application/octet-stream 232.6 KB
v20260317-0003-review_comments.patch application/octet-stream 41.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message KAZAR Ayoub 2026-03-17 23:02:28 Re: Speed up COPY TO text/CSV parsing using SIMD
Previous Message Robert Haas 2026-03-17 21:45:52 Re: pg_plan_advice