Re: Changing the state of data checksums in a running cluster

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-03-16 23:36:11
Message-ID: 032619A2-D466-4A12-A524-98359D96AEA6@yesql.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 15 Mar 2026, at 23:47, Tomas Vondra <tomas(at)vondra(dot)me> wrote:

>> * The change to XLOG_CHECKPOINT_REDO to move the wal_level into a proper record
>> structure should be pulled out as a 0001 patch as it's an cleanup that has
>> value on its own.
>
> Makes sense, but it's going to be harder because since d774072f0040 all
> 4 bits in XLR_INFO are used.

Fixed by adding a second XLOG rmgr.

> 1) Is this actually doing the expected thing?
>
> INJECTION_POINT("datachecksumsworker-initial-dblist", DatabaseList);
>
> We're passing a regular pointer to the database list, so can the
> injection point actually modify it? I suppose it happens to work because
> dc_dblist() removes the last item, so the pointer to the list does not
> change. But that's luck.

Fixed.

> 2) ProcessAllDatabases may be misusing processed_databases

Good point, we need to track both the number of processed as well as the
cumulative total.

> 3) DATACHECKSUMSWORKER_MAX_DB_RETRIES / DATACHECKSUMSWORKER_FAILED
>
> What happens if a database reaches the maximum number of retries? We
> mark that entry as failed, but AFAIK we'll still try to process any
> remaining databases. Isn't that already doomed and we won't be able to
> enable checksums? So why not to simply abort the loop right away?

It might be, but it can also fail because it is concurrently dropped, in that
case we don't consider it a failure as it is the expected outcome. This is
tested for at the end of the loop, but maybe it can be detected sooner to error
out early on actual failures.

--
Daniel Gustafsson

Attachment Content-Type Size
v20260316-0001-Add-proper-WAL-record-for-XLOG_CHECKPOINT_.patch application/octet-stream 3.9 KB
v20260316-0002-Online-enabling-and-disabling-of-data-chec.patch application/octet-stream 232.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-03-16 23:37:39 Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Previous Message Michael Paquier 2026-03-16 23:26:19 Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record