Re: Changing the state of data checksums in a running cluster

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2025-08-27 12:42:05
Message-ID: dfe57980-f594-46c5-af39-852ff30d34fa@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/27/25 14:39, Tomas Vondra wrote:
> ...
>
> And this happened on Friday:
>
> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
> Author: Alexander Korotkov <akorotkov(at)postgresql(dot)org>
> Date: Fri Aug 22 18:44:39 2025 +0300
>
> Revert "Get rid of WALBufMappingLock"
>
> This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
> It appears that conditional variables are not suitable for use
> inside critical sections. If WaitLatch()/WaitEventSetWaitBlock()
> face postmaster death, they exit, releasing all locks instead of
> PANIC. In certain situations, this leads to data corruption.
>
> ...
>
> I think it's very likely the checksums were broken by this. After all,
> that linked thread has subject "VM corruption on standby" and I've only
> ever seen checksum failures on standby on the _vm fork.
>

Forgot to mention - I did try with c13070a27b reverted, and with that I
can reproduce the checksum failures again (using the fixed TAP test).

It's not a definitive proof, but it's a hint c13070a27b63 was causing
the checksum failures.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kirill Reshke 2025-08-27 12:55:27 Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Previous Message Tomas Vondra 2025-08-27 12:39:38 Re: Changing the state of data checksums in a running cluster