Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Cc:	Ayush Tiwari <ayushtiwari(dot)slg01(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2026-05-26 18:12:35
Message-ID:	f1281cf3-89a3-4936-9bc5-2a5a6291229f@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 5/4/26 15:16, Tomas Vondra wrote:
> Hi,
>
> ... snip ...
>
> Let me briefly explain what the various TAP tests aim to do. From the
> very beginning, my main concern regarding this patch was race conditions
> when updating the shared state about effective data_checksum_version.
> Because the state is effectively split into about three or four places:
>
> * LocalDataChecksumVersion (local cache)
> * XLogCtl->data_checksum_version (XLogCtl->info_lck)
> * ControlFile->data_checksum_version (ControlFileLock)
> * state in control file on disk
>
> These pieces are protected by different locks, the protocol for updating
> and/or reading the various flags is not trivial (and some of the fixed
> issues were due to ControlFile->data_checksum_version being updated from
> a place that shouldn't have).
>

I still have a funny feeling about the "global checksum version" stored
in multiple places with separate locks, and the protocol when updating
them (e.g. which process does that, when, etc.). I haven't found any
fundamental correctness issue, though.

I kept looking for issues, and I realized StartupXLOG may not do the
right thing on the standby after promotion.

if (XLogCtl->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_ON)
{
XLogChecksums(PG_DATA_CHECKSUM_OFF);

SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->data_checksum_version = PG_DATA_CHECKSUM_OFF;
SetLocalDataChecksumState(XLogCtl->data_checksum_version);
SpinLockRelease(&XLogCtl->info_lck);

ereport(...);
}

Consider this sequence of steps:

1) primary/standby cluster has checksums OFF
2) primary starts enabling checksums
3) primary moves to inprogress-on
4) standby receives that and moves to inprogress-on too
5) primary crashes / shuts down / ...
6) standby gets promoted, and does the StartupXLOG thing
7) standby moves from inprogress-on back to off

However, as there's no EmitProcSignalBarrier(), existing backends on the
hot standby won't be notified about the "off" change. And so there will
be a somewhat inconsistent checksum state, with new backends knowing
it's "off", and old backends thinking it's still inprogress-on.

It's not difficult to reproduce this. I guess the existing TAP tests may
not catch this because they always open a new connection when checking
the state, and so see the new (and correct) version.

I believe a similar issue happens for the inprogress-off case a couple
lines later, but I haven't tried reproducing that one.

Ultimately, neither of these cases should cause checksum failures,
because the "correct" new state is "off", and the old backends are not
verifying checksums either.

I suppose this means we should not be updating the checksum state
without emitting the barrier? I think all other places do that.

>
> ... snip ...
>
> The TAP 012 tests checksum change with a concurrent checkpoint, followed
> by a crash, and tests the final state. It pauses the change at an
> injection point, does a checkpoint, proceeds to the next injection
> point, crashes and does recovery. The expectation is that the final
> state "flips" at some injection point, once it gets further enough, and
> stays there. But what actually happens is this:
>
> a) test_checksum_transition(
> 'disabled', 'enable', undef,
> 'datachecksums-enable-inprogress-checksums-end',
> 'datachecksums-enable-checksums-start',
> 'off');
>
> b) test_checksum_transition(
> 'disabled', 'enable', undef,
> 'datachecksums-enable-checksums-start',
> 'datachecksums-enable-checksums-after-xlog',
> 'on');
>
> c) test_checksum_transition(
> 'disabled', 'enable', 'datachecksums-enable-checksums-start',
> 'datachecksums-enable-checksums-after-xlogs',
> 'datachecksums-enable-checksums-after-xlogctl',
> 'off');
>
> This says that if the checkpoint happens after
> 'datachecksums-enable-inprogress-checksums-end' or after
> 'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
> enabling checksums fails).
>
> But if the checkpoint happens after
> 'datachecksums-enable-checksums-start', we end up with "on" (after
> recovery).
>
> This is a bit surprising, because that injection point is before
> 'datachecksums-enable-checksums-after-xlog'. So the enabling process
> gets further and further, but the final state flips off -> on -> off,
> contradicting the expectation that it changes once.
>
> I haven't quite wrapped my head around it yet, but my understanding is
> this is due to a race condition between the checksum launcher (writing
> XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
> (reading the shmem state and generating REDO).
>
> The launcher does this sequence of steps:
>
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> while the checkpointer (CreateCheckPoint) does this:
>
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
>
> The outcome depends on how exactly these two sequences interleave. For
> example, this can happen:
>
> 1) write XLOG2_CHECKSUMS with new state
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
> so redo won't see it), but the checkpoint will still get the old
> checksum state from XLogCtl. And so the outcome is "off", per case (c).
>
> But it can also happen what case (b) does:
>
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> In which case the REDO will have the old state, but the recovery will
> read the XLOG2_CHECKSUMS, and so end up with "on".
>
> This is the root cause of the surprising behavior in TAP 012, I think.
>
> I attempted to trigger these race conditions in TAP 013, but without
> much success. In the end I realized it probably needs more control,
> waiting for the other process to hit the next injection point before
> unpausing the current one. TAP 014 does that, and it shows that with the
> right interleaving of steps the (c) case can end up with both "on" and
> "off" final state.
>
> As I said, I don't claim I fully understand this yet. But I wouldn't
> call this "bug" - AFAICS it won't produce an incorrect final state (I
> haven't seen any such cases).
>
> Still, I wonder if there's a potential issue I failed to notice.
>

I kept thinking about this non-determinism, and I still think I'm right
about the root cause. The checkpointer and datachecksum worker may
interleave in different ways, affecting the final checksum state. The
existing interlock (or lack of of it) is not sufficient.

To verify this hypothesis, I've done a simple thing - I've introduced a
new LWLock, protecting the "critical part" so that the checkpoint_redo
can't happen in between XLogChecksums() and XLogCtl update. Patch
attached, but I don't propose this for commit, I'm not sure if it's
right (or safe), it's merely a PoC to confirm the issue. But it resolves
the weird cases, simply by prohibiting some of the step sequences (so
some of the tests needed to be commented out, as they'd block).

I'm still not sure if it really is an issue or just an annoyance,
because I've not been able to find a case where it'd lead to checksum
failures (or obviously incorrect final state after recovery).

>
> The other question I had when looking at this (concurrency with
> checkpoints) is what we get by doing
>
> MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>
> whenever updating the state in SetDataChecksums... functions. Because
> the only thing that guarantees is the updates happen on one side of the
> checkpoint record. What does that give us, actually?
>
> It does not seem to prevent this surprising behavior, and it does not
> say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
>

I still don't understand why this needs DELAY_CHKPT_START ...

I also noticed a couple minor comment issues, per attached patch (this
may need pgindent).

regards

--
Tomas Vondra

Attachment	Content-Type	Size
checksums-extra-lwlock.patch	text/x-patch	24.9 KB
comment-fixes.patch	text/x-patch	6.5 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2026-05-04 13:16:34 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Mohamed ALi	2026-05-26 18:32:26	Re: [PATCH] Add NESTED_STATEMENTS option to EXPLAIN
Previous Message	Heikki Linnakangas	2026-05-26 18:00:11	Re: Avoid orphaned objects dependencies, take 3