Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-09-01 12:11:06
Message-ID:	3e67160c-3676-4419-b635-1fdb80dc128e@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 8/29/25 16:38, Tomas Vondra wrote:
> On 8/29/25 16:26, Tomas Vondra wrote:
>> ...
>>
>> I've seen these failures after changing checksums in both directions,
>> both after enabling and disabling. But I've only ever saw this after
>> immediate shutdown, never after fast shutdown. (It's interesting the
>> pg_checksums failed only after fast shutdowns ...).
>>
>
> Of course, right after I send a message, it fails after a fast shutdown,
> contradicting my observation ...
>
>> Could it be that the redo happens to start from an older position, but
>> using the new checksum version?
>>
>
> ... but it also provided more data supporting this hypothesis. I added
> logging of pg_current_wal_lsn() before / after changing checksums on the
> primary, and I see this:
>
> 1) LSN before: 14/2B0F26A8
> 2) enable checksums
> 3) LSN after: 14/EE335D60
> 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench)
> 5) standby restarts with -m fast
> 6) redo starts at 14/230043B0, which is *before* enabling checksums
>
> I guess this is the root cause. A bit more detailed log attached.
>

I kept stress testing this over the weekend, and I think I found two
issues causing the checksum failures, both for a single node and on a
standby:

1) no checkpoint in the "disable path"

In the "enable" path, a checkpoint it enforced before flipping the state
from "inprogress-on" to "on". It's hidden in the ProcessAllDatabases,
but it's there. But the "off" path does not do that, probably on the
assumption that we'll always see the writes in the WAL order, so that
we'll see the XLOG_CHECKSUMS setting checksums=off before seeing any
writes without checksums.

And in the happy path this works fine - the standby is happy, etc. But
what about after a crash / immediate shutdown? Consider a sequence like
this:

a) we have checksums=on
b) write to page P, updating the checksum
c) start disabling checksums
d) progress to inprogress-off
e) progress to off
f) write to page P, without checksum update
g) the write to P gets evicted (small shared buffers, ...)
h) crash / immediate shutdown

Recovery starts from a LSN before (a), so we believe checksums=on. We
try to redo the write to P, which starts by reading the page from disk,
to check the page LSN. We still think checksums=on, and to read the LSN
we need to verify the checksum. But the page was modified without the
checksum, and evicted. Kabooom!

This is not that hard to trigger by hand. Add a long at the end of
SetDataChecksumsOff, start a pgbench on a scale larger than shared
buffers and call pg_disable_data_checksums(). Once it gets stuck on the
sleep, give it more time to dirty and evict some pages, then kill -9. On
recovery you should get the same checksum failures.

FWIW I've only ever seen failures on fsm/vm forks, which matches what I
see in the TAP tests. But isn't it a bit strange?

I think the "disable" path needs a checkpoint between inprogress-off and
off states, same as the "enable" path.

2) no restart point on the standby

The standby has a similar issue, I think. Even if the primary creates
all the necessary checkpoints, the standby may not need to create the
restart point for them. If you look into xlog_redo, it only "remembers"
the checkpoint position, it does not trigger a restart point. Than only
happens in XLogPageRead, based on distance from the previous one.

So a very similar failure to the primary is possible, even with the
extra checkpoint fixing (1). The primary flips checksums in either
direction, generating checkpoints, but the standby does not create the
restart points. But it applies WAL, and some of the pages without
checksums get evicted.

And then the standby fails, and goes to some redo position far back, and
runs into the same checksum failure when trying to check page LSN.

I think the standby needs some logic to force restart point creation
when the checksum flag changed.

I have an experimental WIP branch at:

https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks

It fixes the TAP issues reported earlier (and a couple more), and it
does a bunch of additional tweaks:

a) A lot of debug messages that helped me to figure this out. This is
probably way too much, especially the controlfile updates can be very
noisy on a standby.

b) Adds a simpler TAP, testing just a single node (should be easier to
understand than with failures on standby).

c) Adds an explicit checkpoints, to fix (1). It probably adds too many
checkpoints, though? AFAICS a checkpoint after the "inprogress" phase
should be enough, a checkpoint after the "on/off" can go away.

d) Forces creating a restart point on the first checkpoint after
XLOG_CHECKSUMS record. It's done in a bit silly way, using a static
flag. Maybe there's a more elegant approach, say by comparing the
checksum value in ControlFile to the received checkpoint?

e) Randomizes a couple more GUC values. This needs more thought, it was
done blindly before better understanding how the failures happen (it
requires buffers evicted, not hitting max_wal_size, ...). There are more
params worth randomizing (e.g. the "fast" flag).

Anyway, with (c) and (d) applied, the checksum failures go away. It may
not be 100% right (e.g. we could do away with fewer checkpoints), but it
seems to be the right direction.

I don't have time to cleanup the branch more, I've already spent too
much time looking at LSNs advancing in weird ways :-( Hopefully it's
good enough to show what needs to be fixed, etc. If there's a new
version, I'm happy to rerun the tests on my machines, ofc.

However, there still are more bugs. Attached is a log from a crash after
hitting the assert into AbsorbChecksumsOffBarrier:

Assert((LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) &&
(LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION));

This happened while flipping checksums to 'off, but the backend already
thinks checksum are 'off':

LocalDataChecksumVersion==0

I think this implies some bug in setting up LocalDataChecksumVersion
after connection, because this is for a query checking the checksum
state, executed by the TAP test (in a new connection, right?).

I haven't looked into this more, but how come the "off" direction does
not need to check InitialDataChecksumTransition?

I think the TAP test turned out to be very useful, so far. While
investigating on this, I thought about a couple more tweaks to make it
detect additional issues (on top of the randomization).

- Right now the shutdowns/restarts happen only in very limited places.
The checksum flips from on/off or off/on, and then a restart happens.
AFAICS it never happens in the "inprogress" phases, right?

- The pgbench clients connect once, so there are almost no new
connections while flipping checksums. Maybe some of the pgbenches should
run with "-C", to open new connections. It was pretty lucky the TAP
query hit the assert, this would make it more likely.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
assert.log	text/x-log	6.7 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2025-08-29 14:38:22 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	shveta malik	2025-09-01 12:15:31	Re: Conflict detection for update_deleted in logical replication
Previous Message	Zhijie Hou (Fujitsu)	2025-09-01 11:37:30	RE: Conflict detection for update_deleted in logical replication