Re: Changing the state of data checksums in a running cluster

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-02-03 16:58:22
Message-ID: 2535E72D-5979-4F58-A2D3-2AA4AA335C20@yesql.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 15 Dec 2025, at 23:21, Andres Freund <andres(at)anarazel(dot)de> wrote:

> My suggestion for how to do this instead is to put the checksum state into the
> XLOG_CHECKPOINT_* records. When starting recovery from an online checkpoint,
> I think we should use the ChecksumType from the XLOG_CHECKPOINT_REDO record,
> that way the standby/recovery environment's assumption about whether checksums
> were enabled is the same as it was at the time the WAL was generated. For
> shutdown checkpoints, we could either start to emit a XLOG_CHECKPOINT_REDO, or
> we can use the information from the checkpoint record itself.
>
> I think if we only ever use the checksum state from the point where we start
> recovery, we might not need to force *any* checkpoints.

I've implemented this in the attached patch and it seems very promising. There
are no checksums at all left and it (mostly) just works. There are still bugs
left, which cause page verification failures in the 006/007 tests, but I hope
that yours or Tomas' eagle eyes can help me spot them since I failed to (and
this is far from my area of expertise so I appreciate pointers for learning).
This might be due to my implementation being incomplete, but I felt it was time
to show at least an intermediate step along the way.

> Daniel and I chatted about that proposal, and couldn't immediately come up
> with scenarios where that would be wrong. For a while I thought there would
> be problems when doing PITR from a base backup that had checksums initially
> enabled, but where checksums were disabled before the base backup was
> completed. My worry was that a later (once checksums were disabled) hint bit
> write (which would not necessarily be WAL logged) would corrupt the checksum,
> but I don't think that's a problem, because the startup process will only read
> data pages in the process of processing WAL records, and if there's a WAL
> record, there would also have to be an FPW, which would "cure" the
> unchecksummed page.
>
> More comments below, inline - I wrote these first, so it's possible that I
> missed updating some of them in light of what I now wrote above.

Some were completely invalidated by the new approach, and I've removed them
since the code they referred to was removed as well.

>> +static const ChecksumBarrierCondition checksum_barriers[] =
>> +{
>> + {PG_DATA_CHECKSUM_OFF, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION}, 2, {PG_DATA_CHECKSUM_VERSION}, 1},
>> + {PG_DATA_CHECKSUM_VERSION, {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION}, 1, {0}, 0},
>> + {PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION, {PG_DATA_CHECKSUM_ANY_VERSION}, 1, {PG_DATA_CHECKSUM_VERSION}, 1},
>> + {PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION, {PG_DATA_CHECKSUM_VERSION}, 1, {0}, 0},
>> + {-1}
>> +};
>
> The explanation for this doesn't really explain what the purpose of this thing
> is... Perhaps worth referencing datachecksumsworker.c or such?

Done. Might need more justification, but at least there is an attempt. I've
tried to not spread the documentation into too many places to keep it from
drifting apart, so I added a reference to datachecksumsworker.c.

> For a local, constantly sized, array, you shouldn't need a -1 terminator, as
> you can instead use lengthof() or such to detect invalid accesses.

Fixed.

>> +/*
>> + * Flag to remember if the procsignalbarrier being absorbed for checksums is
>> + * the first one. The first procsignalbarrier can in rare cases be for the
>> + * state we've initialized, i.e. a duplicate. This may happen for any
>> + * data_checksum_version value when the process is spawned between the update
>> + * of XLogCtl->data_checksum_version and the barrier being emitted. This can
>> + * only happen on the very first barrier so mark that with this flag.
>> + */
>> +static bool InitialDataChecksumTransition = true;
>
> This is pretty hard to understand right now, at the very least it needs an
> updated comment. But perhaps we can just get rid of this and accept barriers
> that are redundant.

I think you are right, removed.

>> - doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
>> + doPageWrites = (Insert->fullPageWrites ||
>> + Insert->runningBackups > 0 ||
>> + DataChecksumsNeedWrite());
>>
>> if (doPageWrites &&
>> (!prevDoPageWrites ||
>
> Why do we need to separately check for DataChecksumsNeedWrite() if turning on
> checksums also forces Insert->fullPageWrites to on?

Fair point, removed.

>> + MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>> + START_CRIT_SECTION();
>
> ISTM that delayChkptFlags ought to only be set once within the critical
> section. Obviously we can't fail inbetween as the code stands, but that's not
> guaranteed to stay this way.

Agreed, that was incorrect, fixed.

>> + XLogChecksums(PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
>> +
>> + SpinLockAcquire(&XLogCtl->info_lck);
>> + XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION;
>> + SpinLockRelease(&XLogCtl->info_lck);
>
> Maybe worth adding an assertion checking that we are currently in an expected
> state (off or inprogress, I think?).

The inprogress state would be harmless, but I think we should make sure to only
ever be in the off state if we get as far as emitting a procsignalbarrier.

> Think it might be worth mentioning that we rely on the memory ordering implied
> by XLogChecksums() and WaitForProcSignalBarrier() for the changes to
> delayChkptFlags. Unless we have a different logic around that?

No other bespoke logic. Where do you propose calling it out to keep from
getting buried?

>> + /*
>> + * The only allowed state transition to "on" is from "inprogress-on" since
>> + * that state ensures that all pages will have data checksums written.
>> + */
>> + if (XLogCtl->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION)
>> + {
>> + SpinLockRelease(&XLogCtl->info_lck);
>> + elog(ERROR, "checksums not in \"inprogress-on\" mode");
>
> Seems like a PANIC condition to me...

Fair point. Fixed.

>> + }
>> +
>> + SpinLockRelease(&XLogCtl->info_lck);
>> +
>> + MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>> + INJECTION_POINT("datachecksums-enable-checksums-delay", NULL);
>> + START_CRIT_SECTION();
>
> I think it's a really really bad idea to do something fallible, like
> INJECTION_POINT, after setting delayChkptFlags, but before entering the crit
> section. Any error in the injection point will lead to a corrupted
> delayChkptFlags, no?

Agreed, reordered.

>> + XLogChecksums(0);
>
> Why no symbolic name here?

For a long time this patch didn't add a name for the OFF case since the current
code only has a version for ON and not OFF. When it was I added I missed
updating all places where it should go. Fixed.

>> + while (condition->target != target_state && condition->target != -1)
>> + condition++;
>> + if (unlikely(condition->target == -1))
>> + ereport(ERROR,
>> + errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("invalid target state %i for data checksum barrier",
>> + target_state));
>
> FWIW, you don't need unlikely() when the branch does an ereport(ERROR), as
> ereports >=ERROR are marked "cold" automatically.

Nice, I didn't know that. unlikely() removed. Thinking more about ut, this
whole conditional should probably be an assertion instead.

Also switched to looping around lengthof(), removing the -1 sentinel, as
mentioned in another review comment.

>> + default:
>> + Assert(state.new_checksumtype == 0);
>> + barrier = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF);
>> + WaitForProcSignalBarrier(barrier);
>> + break;
>
> I'd not add a default clause, but add a case each value of the enum. That way
> we'll get warnings if the set of states changes.

This wasn't using the enum which was why it had a default, but I totally agree
and have switched that over now.

> These WaitForProcSignalBarrier() are one of the scariest bits of the
> patchset. If the startup process were to hold any lock that backends need, and
> the backends waited for that lock without processing interrupts, we'd have an
> undetected deadlock. This is much more likely to be a problem for the startup
> process, as it does the work of many backends on the primary.
>
> We do process interrupts while waiting for heavyweight locks, so that at least
> is not an issue. Seems worth to call out rather explicitly.

I've added a note about it, not entirely sure what else to do really.

>> - if (!noverify_checksums && DataChecksumsEnabled() &&
>> + if (!noverify_checksums &&
>> + DataChecksumsNeedWrite() &&
>> RelFileNumberIsValid(relfilenumber))
>> verify_checksum = true;
>
> Why is DataChecksumsNeedWrite() being tested here?

That seems incorrect, it should check NeedVerify() to test if checksums are
enabled. I've also refactored this to do finer grained checking than just a
single check in sendFile as is sufficient when the state cannot change.

>> + * have checksums set, but no reads will fail due to incorrect checksum.
>
> Maybe "... due to not yet set checksums."? Incorrect checksums sounds like
> it's about checksums that are actively wrong, rather than just not set. Except
> for the corner case of a torn page in the process of an hint bit write, after
> having disabled checksums, there shouldn't be incorrect ones, right?

Correct, I meant invalid checksums and not incorrect checksums. Fixed, and
also added a sentence discussing why invalid checksums can be present in the
first place.

>> The
>> + * DataChecksumsWorker will compile a list of databases which exist at the
>> + * start of checksumming, and all of these which haven't been dropped during
>> + * the processing MUST have been processed successfully in order for checksums
>> + * to be enabled. Any new relation created during processing will see the
>> + * in-progress state and will automatically be checksummed.
>
> What about new databases created while checksums are being enabled? They could
> be copied before the worker has processed them. At least for the file_copy
> strategy, the copy will be verbatim and thus will not necessarily have
> checksums set.

This is a pretty terrible sentence, mixing two concepts together and making it
unreadable.

Enabling checksums is naive to avoid the very problem you refer to. A list of
databases is generated and processed. When done, it starts over and generates
a new list and process each new entry, skipping the ones already seen, until no
new entries appear in the list. Avoiding those created after processing, where
we can be certain there are checksums set, is a TODO optimization.

I have reworded that sentence and also added it to the Optimizations section of
the header comment.

>> + * There are two levels of synchronization required for changing data_checksums
>
> Maybe s/levels/steps/?

Yeah, much better. Maybe 'stages' is even better than that though?

>> + * Synchronizing the state change is done with procsignal barriers, where the
>> + * WAL logging backend updating the global state in the controlfile will wait
>
> It's not entirely obvious what "the WAL logging backend" means.

Even I am not entirely sure what I meant, reworded.

>> + * interrogating state and resumed when the IO operation has been performed.
>> + *
>> + * When Enabling Data Checksums
>> + * ----------------------------
>
> Odd change in indentation here.

It was intended to make this a subsection under "Synchronization and
Correctness" which is the main section. Doing that with actual section
numbering will be a lot clearer though. Done that way. I also tried to make
the sections a bit less dense by adding some vertical whitespace.

>> + * Bd: Backends in "off" state
>> + * Bi: Backends in "inprogress-on" state
>> + *
>> + * If processing is started in an online cluster then all backends are in Bd.
>> + * If processing was halted by the cluster shutting down, the controlfile
>> + * state "inprogress-on" will be observed on system startup and all backends
>> + * will be placed in Bd.
>
> Why not in Bi? Just for simplicities sake? ISTM we already need to be sure
> that new backends start in Bi, as they might never observe the barrier...

When we start up after a crash and find that the processing didn't finish the
state is reverted to "off", we don't know if the user want to restart
processing at that point. What this comment fails to mention is that the
controlfile is also reverted to "off".

>> + * all pages and enables data checksums cluster-wide via another
>
> s/enables/enabled/?

Fixed in the rewrite.

>> + * * Invent a lightweight WAL record that doesn't contain the full-page
>> + * image but just the block number: On replay, the redo routine would read
>> + * the page from disk.
>
> The last sentence might be truncated?

This TODO item is a verbatim quote from Heikki which was brought up as a review
comment in b832a496-f0aa-c7e4-cca4-6bd0af4d4a90(at)iki(dot)fi(dot) Re-reading this I am
not entirely sure what it would imply, so I opted for removing it instead.

>> + /*
>> + * As of now we only update the block counter for main forks in order to
>> + * not cause too frequent calls. TODO: investigate whether we should do it
>> + * more frequent?
>> + */
>> + if (forkNum == MAIN_FORKNUM)
>> + pgstat_progress_update_param(PROGRESS_DATACHECKSUMS_BLOCKS_TOTAL,
>> + numblocks);
>
> That doesn't make much sense to me. Presumably the reason to skip it for the
> other forks is that they're small-ish. But if so, there's no point in skipping
> the reporting either, as presumably there wouldn't be a lot of reporting?

Fair point, removed.

>> +static void
>> +launcher_exit(int code, Datum arg)
>> +{
>> + if (launcher_running)
>> + {
>> + LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
>> + launcher_running = false;
>> + DataChecksumsWorkerShmem->launcher_running = false;
>> + LWLockRelease(DataChecksumsWorkerLock);
>> + }
>> +}
>
> Could we end up exiting this with the worker still running?

Not during regular operation AFAICT, but it might still be possible. For now
I've added a logging of it and a TODO comment.

>> +/*
>> + * WaitForAllTransactionsToFinish
>
> I think either here, or at its callsites, the patch needs to explain *why* we
> are waiting for all transactions to finish. Presumably this is to ensure that
> other sessions haven't created relations that we can't see yet?

Correct. I've updated the comments.

> It actually doesn't seem to wait for all transactions, ust for ones with an
> xid?

Yes, is that not enough for this purpose or am I missing something?

>> + /* Retry every 3 seconds */
>> + ResetLatch(MyLatch);
>> + rc = WaitLatch(MyLatch,
>> + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
>> + 3000,
>> + WAIT_EVENT_CHECKSUM_ENABLE_STARTCONDITION);
>> +
>> + /*
>> + * If the postmaster died we won't be able to enable checksums
>> + * cluster-wide so abort and hope to continue when restarted.
>> + */
>> + if (rc & WL_POSTMASTER_DEATH)
>> + ereport(FATAL,
>> + errcode(ERRCODE_ADMIN_SHUTDOWN),
>> + errmsg("postmaster exited during data checksum processing"),
>> + errhint("Restart the database and restart data checksum processing by calling pg_enable_data_checksums()."));
>> +
>> + LWLockAcquire(DataChecksumsWorkerLock, LW_SHARED);
>> + if (DataChecksumsWorkerShmem->launch_operation != operation)
>> + abort_requested = true;
>> + LWLockRelease(DataChecksumsWorkerLock);
>> + if (abort_requested)
>> + break;
>
> I don't like this much - loops with a timeout are generally a really bad idea
> and we shouldn't add more instances. Presumably this also makes the tests
> slower...
>
> How about collecting the to-be-waited-for virtualxids and then wait for those?

This seemed the safe option, how does one get the set of virtualxids to wait for?

>> + RESUME_INTERRUPTS();
>
> I don't understand what this interrupt stuff achieves here?

Removed.

>> -#define NUM_AUXILIARY_PROCS (6 + MAX_IO_WORKERS)
>> -
>> +#define NUM_AUXILIARY_PROCS (8 + MAX_IO_WORKERS)
>
> Aren't they bgworkers now?

Correct, this is a very old leftover from when the ChecksumHelper was based off
of autovacuum and wasn't a bgworker. Hunk reverted.

>> + PROCSIGNAL_BARRIER_CHECKSUM_OFF,
>> + PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_ON,
>> + PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF,
>> + PROCSIGNAL_BARRIER_CHECKSUM_ON,
>> } ProcSignalBarrierType;
>
> I wonder if these really should be different barriers. What if we just made it
> one, and instead drove the transition on the current shmem content?

I sort of like when the barrier types to encode state, but I don't have super
strong feelings on that. If you think it's something that we should do then I
can hack that up.

> Other stuff:
> - what protects against multiple backends enabling checksums at the same time?
>
> Afaict there isn't anything, and we just ignore the second request. Which
> seems ok-ish if it's the same request as before, but not great if it's a
> different one.

If a different request comes in it will be recorded as a new launch_operation,
and the processing will after each block double check that the current
operation matches the launch_operation. If datachecksumsworker is enabling
checksums and the launch_operation is set to disable, checksums will be
disabled and processing end. I added a short to the documentation about this.

> Should also have tests for that.

I have added some rudimentary tests covering this, but it's surprisingly hard
to make stable tests without putting all hope on synchronization around delays.
If this gets committed I doubt they should be part of that, but for now they
illustrate a point at least.

> - I think this adds a bit too much of the logic to xlog.c, already an unwieldy
> file. A fair bit of all of this doesn't seem like it needs to be in there.

No disagreement, but thinking about it I struggled to figure out a good place
to put it. Do you have any suggestions?

> - the code seems somewhat split brained about bgworkers and auxprocesses

Is it, or was it mainly the docs? I've tried to rectify what I saw but I might
be misunderstanding your point here.

The attached .txt contains the fixes done above before they were squashed into
the patch, to assist reviewing in case anyone wants to see the diff.

--
Daniel Gustafsson

Attachment Content-Type Size
v20250203-2-0001-Online-enabling-and-disabling-of-data-ch.patch application/octet-stream 222.2 KB
v20250203-2-Review-fixes.txt text/plain 100.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2026-02-03 17:03:51 Re: Change default of jit to off
Previous Message vignesh C 2026-02-03 16:52:52 Re: Skipping schema changes in publication