Quick Links

Re: Changing the state of data checksums in a running cluster

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Changing the state of data checksums in a running cluster
Date:	2025-11-10 01:26:07
Message-ID:	f8dbad9d-95c0-42ad-88f2-95ae8e6aa251@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

We had some off-list discussions about this patch (me, Daniel and
Andres), and Andres mentioned he suspects the patch may be breaking PITR
in some way. We didn't have any example of that, but PITR seems like a
pretty fundamental feature, so I took it seriously and decided to do
some stress testing. And yeah, there are issues :-(

I did a similar stress testing in the past, which eventually evolved
into the two TAP tests in the current patch. Those TAP tests run either
a single node or primary-standby cluster, and flip checksums on/off
while restarting the instance(s). And verify the cluster behaves OK,
with no checksum failures, unexpected states, etc.

I chose to do something similar to test PITR - run a single node with
pgbench (or some other write activity), flip checksums on/off in a loop,
while taking basebackups. And then validate the basebackups are valid,
and can be used to start a new instance. I planned to extend this to
more elaborate tests with proper PITR using a WAL archive, etc. I didn't
get that far - this simple test already hit a couple issues.

I'm attaching sets of test scripts, and the scripts used to validate
backups. The scripts are fairly simple, but need some changes to run -
adjusting some paths, etc.

1) basebackup-short.tgz - basic basebackup test

2) basebackup-long.tgz - long-runnning basebackup test (throttled)

3) validate.tgz - validate backups from (1) and (2)

The test scripts expect "scale" parameter for pgbench. I used 500, but
that's mostly arbitrary - maybe a different value would hit the issues
more often. Not sure.

While testing I ran into two issues while validating the backups (which
is essentially about performing the usual basebackup redo).

1) assert in a checkpointer

TRAP: failed Assert("TransactionIdIsValid(initial)"), File:
"procarray.c", Line: 1707, PID: 2649754
postgres: checkpointer (ExceptionalCondition+0x56) [0x55d64ec38f96]
postgres: checkpointer (+0x5341d8) [0x55d64eac41d8]
postgres: checkpointer (GetOldestTransactionIdConsideredRunning+0xc)
[0x55d64eac598c]
postgres: checkpointer (CreateRestartPoint+0x725) [0x55d64e7dd2f5]
postgres: checkpointer (CheckpointerMain+0x3ec) [0x55d64ea38a8c]
postgres: checkpointer (postmaster_child_launch+0x102) [0x55d64ea3b592]
postgres: checkpointer (+0x4ad74a) [0x55d64ea3d74a]
postgres: checkpointer (PostmasterMain+0xce7) [0x55d64ea40b97]
postgres: checkpointer (main+0x1ca) [0x55d64e713b1a]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f55f1a33ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f55f1a33d65]
postgres: checkpointer (_start+0x21) [0x55d64e713e81]

I was really confused by this initially, because why would this break
tracking of running transactions in a snapshot, etc.? But that's really
a red herring - the real issue is calling CreateRestartPoint().

This happens because we need to ensure that a standby does not get
confused about checksums state, so redo of XLOG_CHECKSUMS forces
creating a restart point whenever the checksum state changes. Which has
some negative consequences, as discussed in my previous message.

But AFAICS in this case the root cause is calling this during a regular
redo, not just on a standby. Presumably there's a way to distinguish
these two cases, and skip the restart point creation on a simple redo.
Alternatively, maybe the standby should do something different instead
of creating a restart point. (More about that later.)

2) unexpected state during redo of XLOG_CHECKSUMS record

An example of the failure:

LOG: redo starts at 36/5B028668
FATAL: incorrect data checksum state 3 for target state 1
CONTEXT: WAL redo at 37/E136E408 for XLOG/CHECKSUMS: on
ERROR: incorrect data checksum state 3 for target state 1
ERROR: incorrect data checksum state 3 for target state 1
ERROR: incorrect data checksum state 3 for target state 1
ERROR: incorrect data checksum state 3 for target state 1
ERROR: incorrect data checksum state 3 for target state 1
LOG: startup process (PID 2649028) exited with exit code 1

I was really confused about this at first, because I haven't seen this
during earlier testing (which did redo after a crash), so why should it
happen here?

But the reason is very simple - the basebackup can run for a while,
spanning multiple checkpoints, and also multiple changes of checksum
state (enable/disable). Furthermore, basebackup copies the pg_control
file last, and that's where we get the checksum state from.

For the earlier crash/restart test that's not the case, because each
checksum change creates a checkpoint, and the redo starts from that
point. There can't be multiple XLOG_CHECKSUMS records to apply.

So after a crash we start from the last checkpoint (i.e. from the
initial checksum state), and then there can be only a single
XLOG_CHECKSUMS record to reapply.

But with a basebackup we get the checksum state from the pg_control
copied at the end (so likely the final state, not the initial one), and
there may be multiple XLOG_CHECKSUMS in between (and multiple
checkpoints). So what happens is that we start from the final state, and
then try to apply the earlier XLOG_CHECKSUMS states. Hence the failure.

This seems like a pretty fundamental issue - we can't guarantee a backup
won't run too long, or anything like that. A full backup may easily run
for multiple hours or even days.

I suspect it should be possible to hit an issue similar to those we
observed on the standby in earlier testing (which is why we ended up
creating startpoints on a standby) with "future" blocks.

Imagine a basebackup starts, we disable checksums, and one of the blocks
gets written to the disk without a checksum before the basebackup copies
that file. Then during redo of the backup, we start with checksums=on
(assume we don't have the issue with incorrect state). If we attempt to
read the page before the XLOG_CHECKSUMS, that'll fail, because the
on-disk page does not have a valid checksum, and we need that to read
the page LSN.

I only speculate this can happen, I haven't actually seen / reproduced
it. But maybe it's fixed thanks to FPI, or something like that? It
didn't help on the standby, though. And in that case allowing a random
mix of pages with/without checksums in a basebackup seems problematic.

What could we do about the root cause? We discussed this with Daniel and
we've been stuck for quite a while. But then it occurred to us maybe we
can simply "pause" the checksum state change while there's backup in
progress. We already enable/disable FPW based on this, so why couldn't
we check XLogCtl->Insert.runningBackups, and only advance to the next
checksum state if (runningBackups==0)?

That would mean a single backup does not need to worry about seeing a
mix of blocks written with different checksum states, and it also means
the final pg_control file has the correct checksum state, because it is
not allowed to change during the basebackup.

Of course, this would mean checksum changes may take longer. A corner
case is that database with a basebackup running 100% of the time won't
be able to change checksums on-line. But to me that seems acceptable, if
communicated / documented clearly.

It also occurred to me something like this might help with the standby
case too. On the standby, the problem happens when it skips checkpoints
when creating a restart point, in which case redo may go too far back,
with an incompatible checksum state. Maybe if a standby reported LSN of
the last restartpoint, the primary could use that to decide whether it's
safe to update the checksum state. Of course, this is tricky, because
standbys may be disconnected, there's cascading replications, etc.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
validate.tgz	application/x-compressed-tar	636 bytes
basebackup-long.tgz	application/x-compressed-tar	1.1 KB
basebackup-short.tgz	application/x-compressed-tar	1.1 KB

In response to

Re: Changing the state of data checksums in a running cluster at 2025-11-05 14:05:48 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Chao Li	2025-11-10 02:06:35	Re: Suggestion to add --continue-client-on-abort option to pgbench
Previous Message	Michael Paquier	2025-11-10 01:11:42	Add tests for object size limits of injection points