Re: Changing the state of data checksums in a running cluster

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Ayush Tiwari <ayushtiwari(dot)slg01(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-04-28 22:06:14
Message-ID: FA4C386A-CDD2-4F6F-9C5D-9B88FC6F5D3D@yesql.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 25 Apr 2026, at 13:59, Ayush Tiwari <ayushtiwari(dot)slg01(at)gmail(dot)com> wrote:

> I don't think 0006 fully addresses the duplicate-launcher exit race [1]. I was
> able to reproduce the issue with a deterministic test using the existing
> test_checksums delay hooks: delay
> the initial StartDataChecksumsWorkerLauncher()
> path so two quick pg_enable_data_checksums() calls can queue two launchers,
> delay the final transition to on, then fire follow-up pg_enable_data_checksums()
> calls while the real launcher is still active.

Thanks to the really great reproducer script I got offlist (which is attached
here) I was able to reproduce and figure out what was wrong. This sequence
exposed a thinko in the cleanup handler where a process exiting due to an
already running launcher would enter cleanup, and thus wipe state from the
running launcher. It needed this quite narrow window to hit, but it was good
to catch that. Fixed in the attached by giving processes a way to register
themselves as needing no cleanup when exiting.

> A few other smaller comments/nits:
>
> 1. In src/backend/access/transam/xlog.c, the comment in SetDataChecksumsOff()
> says the branch implies the state is inprogress-on or inprogress-off, but
> after the patch inprogress-on is handled by the preceding if condition. It
> looks like this should probably only mention inprogress-off.

Fixed.

> 2. There is a repeated typo in comments:
>
> src/backend/utils/init/postinit.c: "interrups" -> "interrupts"
> src/backend/postmaster/auxprocess.c: "interrups" -> "interrupts"

Fixed.

> 3. A few commit messages could use a quick polish pass:
>
> 0002: "onine" -> "online"
> 0004: "Additionelly" -> "Additionally"
> 0006: "being enabled of disabled" -> "being enabled or disabled"
> 0006: "change it's operation" -> "change its operation"
> 0006: "ss now avoided" -> "is now avoided"

Fixed.

> 4. One minor question on 0006: the runtime cost-parameter update path is only
> meaningful while checksums are being enabled. That looks fine from the current
> control flow, but would it be worth adding a short comment or assertion near
> that update path to make the assumption explicit?

We have this assertion which makes sure that there are no costs passed for
disabling, not sure if we need much more since there is now way for the user to
inject any cost parameters.

#ifdef USE_ASSERT_CHECKING
/* The cost delay settings have no effect when disabling */
if (op == DISABLE_DATACHECKSUMS)
Assert(cost_delay == 0 && cost_limit == 0);
#endif

--
Daniel Gustafsson

Attachment Content-Type Size
v2-0001-Prevent-pg_enable-disable_data_checksums-on-stand.patch application/octet-stream 1.5 KB
v2-0002-Test-improvements-for-online-checksums.patch application/octet-stream 7.3 KB
v2-0003-Handle-data_checksum-state-changes-during-launche.patch application/octet-stream 6.5 KB
v2-0004-Fix-invalid-checksum-state-transition-in-checkpoi.patch application/octet-stream 13.2 KB
v2-0005-Typo-and-spelling-fixups-for-online-checksums.patch application/octet-stream 3.6 KB
v2-0006-Improve-handling-of-concurrent-checksum-requests.patch application/octet-stream 9.8 KB
v2-0007-Improve-database-detection-logic-in-datachecksums.patch application/octet-stream 2.3 KB
v2-0008-Fix-data_checksum-GUC-show_hook.patch application/octet-stream 974 bytes
0006_bug_test.bash.txt text/plain 6.9 KB
unknown_filename text/plain 1 byte

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-04-28 23:12:04 Re: Fix the error hint message and test for reset_shared with unknown stats type
Previous Message Masahiko Sawada 2026-04-28 21:48:31 Re: Support logical replication of DDLs, take2