| From: | Daniel Gustafsson <daniel(at)yesql(dot)se> |
|---|---|
| To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Cc: | Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Changing the state of data checksums in a running cluster |
| Date: | 2026-04-03 17:14:09 |
| Message-ID: | 4E492B44-C40F-4D03-B306-F8C1EFA535EA@yesql.se |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On 3 Apr 2026, at 18:00, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 03/04/2026 18:33, Daniel Gustafsson wrote:
>> The attached rebase with a PG_CONTROL_VERSION bump is what I have staged for
>> later tonight, submitting here to have the (hopefully) final patch archived as
>> well as another CFBot run.
>
> A few more small comments, I'm sorry about drip-feeding these:
Not at all, thanks for looking! I'm sure there will be a few more things once
there is BF coverage of the patch as well.
>> +/*
>> + * launcher_exit
>> + *
>> + * Internal routine for cleaning up state when the launcher process exits. We
>> + * need to clean up the abort flag to ensure that processing started again if
>> + * it was previously aborted (note: started again, *not* restarted from where
>> + * it left off).
>> + */
>> +static void
>> +launcher_exit(int code, Datum arg)
>> +{
>> + abort_requested = false;
>> +
>> + if (launcher_running)
>> + {
>> + LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
>> + launcher_running = false;
>> + DataChecksumState->launcher_running = false;
>> +
>> + if (DataChecksumState->worker_running != InvalidPid)
>> + {
>> + ereport(LOG,
>> + errmsg("data checksums launcher exiting while worker is still running, signalling worker"));
>> + kill(DataChecksumState->worker_running, SIGTERM);
>> + }
>> + LWLockRelease(DataChecksumsWorkerLock);
>> + }
>> +
>> + /*
>> + * If the launcher is exiting before data checksums are enabled then set
>> + * the state to off since processing cannot be resumed.
>> + */
>> + if (DataChecksumsInProgress())
>> + SetDataChecksumsOff();
>> +}
>
> Is there still a race condition if the launcher is killed, it gets here, sends SIGTERM to the worker process, but before the worker process has exited, the user calls pg_enable_data_checksums() again and a new launcher is started? What happens?
If the new launcher sets the state to inprogress-on before the old one checks
for that in launcher_exit, it would disable checksums, and the new launcher
would fail in the next state transitiom as the state is incorrect. The
solution would be to hold the DataChecksumsWorkerLock exclusively during
launcher_exit to ensure the old one can exit before the new one starts. Done
in the attached.
>> + /*
>> + * Is a worker process currently running? This is set by the worker
>> + * launcher when it starts waiting for a worker process to finish.
>> + */
>> + int worker_running;
>
> 'worker_running' sounds like a boolean, but it's actually a PID. Especially when 'launcher_running' really is a boolean. Maybe rename to 'worker_pid' or 'worker_running_pid' or 'running_worker_pid' or something.
Renamed to worker_pid.
>> +bool
>> +DataChecksumsInProgress(void)
>> +{
>> + return LocalDataChecksumState == PG_DATA_CHECKSUM_INPROGRESS_ON;
>> +}
>
> Perhaps this should be DataChecksumsInProgressOn()? I saw the caller in launcher_exit() first, and had to look up the implementation to check if it returns true for 'inprogress-off' state.
Fixed.
>> diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
>> index ff417d5ae3e..fe5d30b4349 100644
>> --- a/src/include/storage/checksum.h
>> +++ b/src/include/storage/checksum.h
>> @@ -15,6 +15,20 @@
>> #include "storage/block.h"
>> +/*
>> + * Checksum state 0 is used for when data checksums are disabled (OFF).
>> + * PG_DATA_CHECKSUM_INPROGRESS_{ON|OFF} defines that data checksums are either
>> + * currently being enabled or disabled, and PG_DATA_CHECKSUM_VERSION defines
>> + * that data checksums are enabled.
>> + */
>> +typedef enum ChecksumStateType
>> +{
>> + PG_DATA_CHECKSUM_OFF = 0,
>> + PG_DATA_CHECKSUM_VERSION,
>> + PG_DATA_CHECKSUM_INPROGRESS_OFF,
>> + PG_DATA_CHECKSUM_INPROGRESS_ON,
>> +} ChecksumStateType;
>> +
>> /*
>> * Compute the checksum for a Postgres page. The page must be aligned on a
>> * 4-byte boundary.
>
> It'd be good to mention that this value is stored in the control file, so changing it needs a catversion bump. Also it's important that PG_DATA_CHECKSUM_VERSION = 1, for backwards-compatibility in pg_upgrade. I'd suggest assigning explicit values 1, 2, 3, 4 for each of the enum constants, to emphasize that they values are fixed.
Fixed.
> There's an #include "storage/bufpage.h" in src/bin/pg_upgrade/controldata.c and src/backend/access/rmgrdesc/xlogdesc.c. I think they should both include "storage/checksum.h" directly instead. And "bufpage.h" probably doesn't need to include "storage/checksum.h".
Fixed, and validated with headerscheck. Doing this required adding checksum.h
to bootstrap/bootstrap.c which seems very correct as it needs to know about the
state enum.
--
Daniel Gustafsson
| Attachment | Content-Type | Size |
|---|---|---|
| v20260403_2-0001-Online-enabling-and-disabling-of-data-ch.patch | application/octet-stream | 226.0 KB |
| unknown_filename | text/plain | 1 byte |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David G. Johnston | 2026-04-03 17:18:41 | Docs: Create table description for constraints markup fix and label tweaks |
| Previous Message | Robert Haas | 2026-04-03 17:13:47 | Re: pg_plan_advice |