Re: Changing the state of data checksums in a running cluster

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-04-03 16:00:52
Message-ID: 21bc25e8-4d88-4130-a39b-cb9d8243287f@iki.fi
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/04/2026 18:33, Daniel Gustafsson wrote:
> The attached rebase with a PG_CONTROL_VERSION bump is what I have staged for
> later tonight, submitting here to have the (hopefully) final patch archived as
> well as another CFBot run.

A few more small comments, I'm sorry about drip-feeding these:

> +/*
> + * launcher_exit
> + *
> + * Internal routine for cleaning up state when the launcher process exits. We
> + * need to clean up the abort flag to ensure that processing started again if
> + * it was previously aborted (note: started again, *not* restarted from where
> + * it left off).
> + */
> +static void
> +launcher_exit(int code, Datum arg)
> +{
> + abort_requested = false;
> +
> + if (launcher_running)
> + {
> + LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
> + launcher_running = false;
> + DataChecksumState->launcher_running = false;
> +
> + if (DataChecksumState->worker_running != InvalidPid)
> + {
> + ereport(LOG,
> + errmsg("data checksums launcher exiting while worker is still running, signalling worker"));
> + kill(DataChecksumState->worker_running, SIGTERM);
> + }
> + LWLockRelease(DataChecksumsWorkerLock);
> + }
> +
> + /*
> + * If the launcher is exiting before data checksums are enabled then set
> + * the state to off since processing cannot be resumed.
> + */
> + if (DataChecksumsInProgress())
> + SetDataChecksumsOff();
> +}

Is there still a race condition if the launcher is killed, it gets here,
sends SIGTERM to the worker process, but before the worker process has
exited, the user calls pg_enable_data_checksums() again and a new
launcher is started? What happens?

> + /*
> + * Is a worker process currently running? This is set by the worker
> + * launcher when it starts waiting for a worker process to finish.
> + */
> + int worker_running;

'worker_running' sounds like a boolean, but it's actually a PID.
Especially when 'launcher_running' really is a boolean. Maybe rename to
'worker_pid' or 'worker_running_pid' or 'running_worker_pid' or something.

> +bool
> +DataChecksumsInProgress(void)
> +{
> + return LocalDataChecksumState == PG_DATA_CHECKSUM_INPROGRESS_ON;
> +}

Perhaps this should be DataChecksumsInProgressOn()? I saw the caller in
launcher_exit() first, and had to look up the implementation to check if
it returns true for 'inprogress-off' state.

> diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
> index ff417d5ae3e..fe5d30b4349 100644
> --- a/src/include/storage/checksum.h
> +++ b/src/include/storage/checksum.h
> @@ -15,6 +15,20 @@
>
> #include "storage/block.h"
>
> +/*
> + * Checksum state 0 is used for when data checksums are disabled (OFF).
> + * PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF} defines that data checksums are either
> + * currently being enabled or disabled, and PG_DATA_CHECKSUM_VERSION defines
> + * that data checksums are enabled.
> + */
> +typedef enum ChecksumStateType
> +{
> + PG_DATA_CHECKSUM_OFF = 0,
> + PG_DATA_CHECKSUM_VERSION,
> + PG_DATA_CHECKSUM_INPROGRESS_OFF,
> + PG_DATA_CHECKSUM_INPROGRESS_ON,
> +} ChecksumStateType;
> +
> /*
> * Compute the checksum for a Postgres page. The page must be aligned on a
> * 4-byte boundary.

It'd be good to mention that this value is stored in the control file,
so changing it needs a catversion bump. Also it's important that
PG_DATA_CHECKSUM_VERSION = 1, for backwards-compatibility in pg_upgrade.
I'd suggest assigning explicit values 1, 2, 3, 4 for each of the enum
constants, to emphasize that they values are fixed.

There's an #include "storage/bufpage.h" in
src/bin/pg_upgrade/controldata.c and
src/backend/access/rmgrdesc/xlogdesc.c. I think they should both include
"storage/checksum.h" directly instead. And "bufpage.h" probably doesn't
need to include "storage/checksum.h".

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2026-04-03 16:01:33 Re: Import Statistics in postgres_fdw before resorting to sampling.
Previous Message Srinath Reddy Sadipiralla 2026-04-03 15:58:16 Re: Adding REPACK [concurrently]