Re: Changing the state of data checksums in a running cluster

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2025-08-25 18:32:51
Message-ID: 17C1D2E0-C12D-40E2-B4B8-B9CCECA45A88@yesql.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas(at)vondra(dot)me> wrote:

> This happens quite regularly, it's not hard to hit. But I've only seen
> it to happen on a FSM, and only right after immediate shutdown. I don't
> think that's quite expected.
>
> I believe the built-in TAP tests (with injection points) can't catch
> this, because there's no concurrent activity while flipping checksums
> on/off. It'd be good to do something like that, by running pgbench in
> the background, or something like that.

In searching for this bug I opted for implementing a version of the stress
tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
version. It's gated behind PG_TEST_EXTRA since it's clearly not something
which can be enabled by default (if this goes in this need to be re-done to
provide two levels IMO, but during testing this is more convenient). I'm
curious to see which improvements you can think to make it stress the code to
the breaking point.

> I think there's a minor issue in how pg_checksums validates state before
> checking the data.
>
> The current patch simply does:
>
> if (ControlFile->data_checksum_version == 0 &&
> mode == PG_MODE_CHECK)
> pg_fatal("data checksums are not enabled in cluster");
>
> and that worked when the version was either 0 or 1. But now it can be
> also 2 or 3, for inprogress-on / inprogress-off, and if the cluster gets
> shut down at the right moment, that can end in the control file.

Good point, I've changed the test to check for checksums being enabled rather
than checking if they are disabled.

--
Daniel Gustafsson

Attachment Content-Type Size
v20250825-0001-Online-enabling-and-disabling-of-data-chec.patch application/octet-stream 177.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-08-25 18:33:27 Re: index prefetching
Previous Message Nathan Bossart 2025-08-25 18:31:26 Re: GetNamedLWLockTranche crashes on Windows in normal backend