Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2026-03-16 23:36:11
Message-ID:	032619A2-D466-4A12-A524-98359D96AEA6@yesql.se
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> On 15 Mar 2026, at 23:47, Tomas Vondra <tomas(at)vondra(dot)me> wrote:

>> * The change to XLOG_CHECKPOINT_REDO to move the wal_level into a proper record
>> structure should be pulled out as a 0001 patch as it's an cleanup that has
>> value on its own.
>
> Makes sense, but it's going to be harder because since d774072f0040 all
> 4 bits in XLR_INFO are used.

Fixed by adding a second XLOG rmgr.

> 1) Is this actually doing the expected thing?
>
> INJECTION_POINT("datachecksumsworker-initial-dblist", DatabaseList);
>
> We're passing a regular pointer to the database list, so can the
> injection point actually modify it? I suppose it happens to work because
> dc_dblist() removes the last item, so the pointer to the list does not
> change. But that's luck.

Fixed.

> 2) ProcessAllDatabases may be misusing processed_databases

Good point, we need to track both the number of processed as well as the
cumulative total.

> 3) DATACHECKSUMSWORKER_MAX_DB_RETRIES / DATACHECKSUMSWORKER_FAILED
>
> What happens if a database reaches the maximum number of retries? We
> mark that entry as failed, but AFAIK we'll still try to process any
> remaining databases. Isn't that already doomed and we won't be able to
> enable checksums? So why not to simply abort the loop right away?

It might be, but it can also fail because it is concurrently dropped, in that
case we don't consider it a failure as it is the expected outcome. This is
tested for at the end of the loop, but maybe it can be detected sooner to error
out early on actual failures.

--
Daniel Gustafsson

Attachment	Content-Type	Size
v20260316-0001-Add-proper-WAL-record-for-XLOG_CHECKPOINT_.patch	application/octet-stream	3.9 KB
v20260316-0002-Online-enabling-and-disabling-of-data-chec.patch	application/octet-stream	232.6 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2026-03-15 22:47:36 from Tomas Vondra

Responses

Re: Changing the state of data checksums in a running cluster at 2026-03-17 11:45:20 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2026-03-16 23:37:39	Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Previous Message	Michael Paquier	2026-03-16 23:26:19	Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record