Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-10-06 12:53:01
Message-ID:	67726BDA-2F21-4D93-B01F-FA68212ED6F0@yesql.se
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> On 1 Sep 2025, at 14:11, Tomas Vondra <tomas(at)vondra(dot)me> wrote:

> I kept stress testing this over the weekend, and I think I found two
> issues causing the checksum failures, both for a single node and on a
> standby:

Thanks a lot for all your testing. I was able to reproduce ..

> I have an experimental WIP branch at:
>
> https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks

.. and with your changes I also see the reproducer go away. I concur that
you likely found the issue and the right fix for it. I have absorbed your
patches into my branch, the debug logging is left as 0002 but the other ones
are incorporated into 0001.

> It fixes the TAP issues reported earlier (and a couple more), and it
> does a bunch of additional tweaks:
>
> a) A lot of debug messages that helped me to figure this out. This is
> probably way too much, especially the controlfile updates can be very
> noisy on a standby.

I've toned down the logging a bit, but kept most of it in 0002.

> b) Adds a simpler TAP, testing just a single node (should be easier to
> understand than with failures on standby).

This turned out to be more useful than I initially thought, so I've kept this
in the attached version. There could be value in separating the single and
dual node tests into different PG_TEST_EXTRA values given how intensive the
latter is.

> Anyway, with (c) and (d) applied, the checksum failures go away. It may
> not be 100% right (e.g. we could do away with fewer checkpoints), but it
> seems to be the right direction.

I think so too, and while I have removed one of them due to being issued just
before (or after) another checkpoint I do believe this is the right fix for the
issue. There might well be more issues, but I wanted to get a new version out
on the thread to get more visibility on the the new tests.

> I haven't looked into this more, but how come the "off" direction does
> not need to check InitialDataChecksumTransition?

This boiled down into the barrier absorbing functions evolving out of sync with
one another over multiple versions of the patch. To address this absorbing the
barrier has been converted into a single function which is driven by an array
of ChecksumBarrierCondition structs, one for each target state. This struct
defines what the current state of the cluster must be for the barrier to be
successfully absorbed. This removed a lot of duplicate code and also unifies
the previously quite varied levels of assertions at the barrier.

> I think the TAP test turned out to be very useful, so far. While
> investigating on this, I thought about a couple more tweaks to make it
> detect additional issues (on top of the randomization).
>
> - Right now the shutdowns/restarts happen only in very limited places.
> The checksum flips from on/off or off/on, and then a restart happens.
> AFAICS it never happens in the "inprogress" phases, right?

That would be a good idea, as well as use a crashing injection test on top
controlled shutdowns.

> - The pgbench clients connect once, so there are almost no new
> connections while flipping checksums. Maybe some of the pgbenches should
> run with "-C", to open new connections. It was pretty lucky the TAP
> query hit the assert, this would make it more likely.

I've added this to both tests using pgbench, randomized with a cointoss call.

The attached has the above as well as a few other changes:

* A new injection test which calls abort() right before checkpointing has been
added in 005_injection.

* Most ereport calls have gotten proper errcodes in order to aid analysis,
particurly fleet-wide analysis should this be deployed in a larger setting.

* More code-level documentation of test code and several tweaked (and added)
code comments to aid readability.

--
Daniel Gustafsson

Attachment	Content-Type	Size
v20251006-0001-Online-enabling-and-disabling-of-data-chec.patch	application/octet-stream	206.1 KB
v20251006-0002-Log-checksum-version-during-checkpoints-et.patch	application/octet-stream	5.4 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2025-09-01 12:11:06 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Álvaro Herrera	2025-10-06 13:02:29	Re: Differential Code Coverage report for Postgres
Previous Message	jian he	2025-10-06 12:51:07	src/include/utils/float.h comment one link stable