Re: Changing the state of data checksums in a running cluster

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-04-28 20:25:30
Message-ID: CAHg+QDfS0Ckw2HT-TWkdvB1pq9rrFzm1Pk_AUwuFL+Er23ZOFA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Apr 23, 2026 at 3:16 AM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:

> While performing extensive post-commit testing using the built-in TAP
> tests in
> checksum_extended mode, Tomas observed a couple issues. The testing was
> done
> on multiple machines, ranging from a fast x86 machine to a much slower
> rpi4,
> but they all performed a very large number of iterations. The combination
> of
> hardware and the large number of tests executed is likely why some of the
> issues went unnoticed.
>
> Below are the issues described in more detail:
>
> 1) race condition between checkpoint and checksum state changes
>
> The main issue is a race condition causing invalid state transitions.
> When analyzing the failure (which we couldn't synthetically reproduce, only
> observe by running tests over and over for a long time) we realized that
> the
> original coding was performing checksum state transitions while replaying
> online checkpoints as well as redo checkpoints.
>
> Updating checksum state while replaying online checkpoints is incorrect.
> After
> a crash we'll start with the REDO record, which initializes checksums to
> the
> right state. And then later the checksum state is updated by the regular
> XLOG_CHECKSUMS entries.
>
> Moreover, the checksum state in the XLOG_CHECKPOINT_ONLINE record can be
> stale,
> because the value is determined at the checkpoint start, but the WAL entry
> is added at the end. If there is a concurrent checksum change, the value
> written to the WAL record will be stale. Replaying it will cause an
> "invalid
> transition" error later, during the next checksum state update. The fix is
> to
> remove the checksum update from the online checkpoint altogether.
>
> In fact, there's no need for CreateCheckPoint to update checksum state
> in the ControlFile. If the value is stale, it could make it permanent.
> But it's unnecessary - the ControlFile is updated by the process
> performing the checksum state change. So remove that.
>
> This, along with an re-ordering updating the controlfile and procsignal
> barrier
> enission fixes the issue. The re-ordering makes sure that the controlfile
> is
> always updated *before* a procsignalbarrier is emitted to avoid a race
> like the
> one described below:
>
> 1. A barrier for off to inprogress-on is emitted
> 2. All active backends absorbs the barrier
> - All processes in the cluster are in state inprogress-on
> 3. A new backend b' forks, reads controlfile and sets state of off
> 4. The controlfile is updated
> 5. A new backend b'' forks, reads controlfile and sets state to
> inprogress-on
>
> b' and b'' have different states, and b' has an incompatible state with the
> rest of the cluster. Re-ordering as done in the attached makes this go
> away.
>
>
> 2) race condition in launcher exit
>
> Another timing related issue was that reverting to the "off" state then
> launcher errors out had synchronization logic which was racy as it was
> relying
> on the cached checksum state and not the value from XLogCtl. The logic for
> determining if a launcher/worker was already active was also fragile as it
> started another launcher which would overwrite certain data in shared
> memory.
> The attached patch inspects shared memory instead and use that to signal
> the
> running launcher to either abort (disable), or change cost parameters on a
> running enable process. These fixes makes erroring out and going to off
> state
> stable.
>
>
> 3) Concurrency issue with ProcSignalInit / InitLocalDataChecksumState
>
> The checksum barriers must not be consumed before the initial value gets
> properly set. On very slow systems, there could potentially be multiple
> checksum state transitions between a fork and InitLocalDataChecksumState.
> In
> such cases we might get failures due to incorrect transitions.
>
> With the current code this is not a live issue, as there is no place
> checking
> interrupts in between the two functions. But it's fragile, as it's
> trivial to
> break this by adding an elog() somewhere. Which is what happened to us
> while
> debugging the other issues. So better to explicitly hold interrupts for a
> brief moment.
>
> To find the issues, and to validate their fix, Tomas developed a new
> testsuite
> which is attached as a .txt. This is not proposed for adding to v19, it is
> included to showcase what was done, and what will be further hacked on for
> a
> new suite during the v20 cycle. It is gated behind PG_TEST_EXTRA and is
> intended to be executed by select members of the buildfarm.
>
> As part of this postcommit review we also found a few more cleanups and
> smaller
> fixes which are included. The patchset also includes the patch submitted
> upthread by Satyanarayana Narlapuram.
>
> More details can be found in the individual commit messages.
>
> Unless there are objections I would like to go ahead with this fixup
> fairly soon.
>

All the patches applied cleanly and the tests passing. A few minor comments:

In SetDataChecksumsOff, stale comment, only INPROGRESS_OFF can reach the
else section
/*
* Ending up here implies that the checksums state is "inprogress-on"
* or "inprogress-off" and we can transition directly to "off" from
* there.
*/
SpinLockRelease(&XLogCtl->info_lck);

Should we update ControlFile->data_checksum_version at the end-of-recovery?
If not, the disk
is stale compared to in memory until the next checkpoint. The code two
lines below updates
the control file anyways to set the DB_IN_PRODUCTION. Maybe combine the
update with that?
It's no big deal if we don't do it because it will be self correct but
tools like pg_controldata
give stale value for some time.

Thanks,
Satya

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-04-28 20:32:28 Re: [BUG?] macOS (Intel) build warnings: "ranlib: file … has no symbols" for aarch64 objects
Previous Message Hannu Krosing 2026-04-28 19:55:07 Re: Support logical replication of DDLs, take2