Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-08-27 12:42:05
Message-ID:	dfe57980-f594-46c5-af39-852ff30d34fa@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 8/27/25 14:39, Tomas Vondra wrote:
> ...
>
> And this happened on Friday:
>
> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
> Author: Alexander Korotkov <akorotkov(at)postgresql(dot)org>
> Date: Fri Aug 22 18:44:39 2025 +0300
>
> Revert "Get rid of WALBufMappingLock"
>
> This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
> It appears that conditional variables are not suitable for use
> inside critical sections. If WaitLatch()/WaitEventSetWaitBlock()
> face postmaster death, they exit, releasing all locks instead of
> PANIC. In certain situations, this leads to data corruption.
>
> ...
>
> I think it's very likely the checksums were broken by this. After all,
> that linked thread has subject "VM corruption on standby" and I've only
> ever seen checksum failures on standby on the _vm fork.
>

Forgot to mention - I did try with c13070a27b reverted, and with that I
can reproduce the checksum failures again (using the fixed TAP test).

It's not a definitive proof, but it's a hint c13070a27b63 was causing
the checksum failures.

regards

--
Tomas Vondra

In response to

Re: Changing the state of data checksums in a running cluster at 2025-08-27 12:39:38 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kirill Reshke	2025-08-27 12:55:27	Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Previous Message	Tomas Vondra	2025-08-27 12:39:38	Re: Changing the state of data checksums in a running cluster