| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Daniel Gustafsson <daniel(at)yesql(dot)se>, Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Changing the state of data checksums in a running cluster |
| Date: | 2026-02-05 23:15:27 |
| Message-ID: | 775fddbc-4dec-413d-89bb-1a8186858ee6@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 2/5/26 21:01, Daniel Gustafsson wrote:
>> On 3 Feb 2026, at 17:58, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>
>> I've implemented this in the attached patch and it seems very promising.
>
> Turns out I was a bit quick on the send button, and having been focused on the
> failing test_checksums suite I missed running the other suites. (I also
> managed to off-by-one-year name it 202503..) Thanks to some quick off-list
> feedback, here is an updated version which fixes the failing tests and has
> quite a number of smaller touchups.
>
> Like the previous version, I am not sure I have the CHECKPOINT_xxx logic right
> and would love some eyes on it.
>
Thanks for the updated patch. Indeed, I see occasional failures in the
006 test (I suppose 007 would fail too, but I haven't tried).
It's been ages since I looked at this patch, and I managed to forget
most of it. And I'm still learning how this new approach works, so I
don't have a great idea what's the root cause yet. But let me describe
what I see and ask some questions. Maybe that'll help us to figure out
what might be wrong.
Here's what the failing run did:
[22:11:00.610](0.364s) ok 2 - ensure checksums are set to off
[22:11:00.616](0.006s) # LSN before enabling: 0/4F0BCD78
[22:11:00.629](0.013s) ok 3 - ensure data checksums are transitioned to
inprogress-on
[22:11:09.594](8.965s) ok 4 - ensure data checksums are transitioned to on
[22:11:09.600](0.005s) # LSN after enabling: 0/B1CBD290
[22:11:09.675](0.075s) ok 5 - ensure data pages can be read back on primary
### Stopping node "main" using mode fast
# Running: pg_ctl --pgdata
/home/user/work/postgres/src/test/modules/test_checksums/tmp_check/t_006_pgbench_single_main_data/pgdata
--mode fast stop
waiting for server to shut down.... done
Latest checkpoint location: 0/B28CA0D8
Latest checkpoint's REDO location: 0/B28CA0D8
[22:11:11.517](0.486s) ok 8 - ensure checksums are set to on
[22:11:11.525](0.008s) # LSN before disabling: 0/B3B23C88
[22:11:11.539](0.013s) ok 9 - ensure data checksums are transitioned to off
[22:11:11.545](0.006s) # LSN after disabling: 0/B3B3F430
[22:11:13.596](2.052s) ok 10 - ensure data pages can be read back on primary
### Stopping node "main" using mode immediate
# Running: pg_ctl --pgdata
/home/user/work/postgres/src/test/modules/test_checksums/tmp_check/t_006_pgbench_single_main_data/pgdata
--mode immediate stop
waiting for server to shut down.... done
Latest checkpoint location: 0/B28CA158
Latest checkpoint's REDO location: 0/B28CA158
### Starting node "main"
### Stopping node "main" using mode fast
[22:11:14.130](0.428s) not ok 12 - no checksum validation errors in
primary log (during WAL recovery)
[22:11:14.130](0.000s)
[22:11:14.130](0.000s) # Failed test 'no checksum validation errors in
primary log (during WAL recovery)'
# at t/006_pgbench_single.pl line 180.
So it:
1) starts with checksums=off
2) flips them on by LSN 0/B1CBD290
3) restarts the instance with 'fast' mode
4) flips checksums to 'off'
5) restarts the instance with 'immediate' mode (=> crash)
6) finds checksum failures in the server log (after redo)
The failures like this (all seem to be from VM, and the VM page LSNs are
somewhat interesting too - I'll get to that shortly):
2026-02-05 22:11:13.724 CET startup[286330] LOG: redo starts at
0/B28CA1D8
2026-02-05 22:11:13.729 CET startup[286330] LOG: page verification
failed, calculated checksum 17285 but expected 50010 (page LSN 0/A1CF17F0)
2026-02-05 22:11:13.729 CET startup[286330] CONTEXT: WAL redo at
0/B295E930 for Heap/LOCK: ...
2026-02-05 22:11:13.729 CET startup[286330] LOG: invalid page in
block 4 of relation "base/5/16409_vm"; zeroing out page
2026-02-05 22:11:13.732 CET startup[286330] LOG: page verification
failed, calculated checksum 49687 but expected 51187 (page LSN 0/B29F5A20)
2026-02-05 22:11:13.732 CET startup[286330] CONTEXT: WAL redo at
0/B299E5F8 for Heap/UPDATE: ...
2026-02-05 22:11:13.732 CET startup[286330] LOG: invalid page in
block 42 of relation "base/5/16409_fsm"; zeroing out page
...
During the test, the cluster went through these checkpoints / checksums:
lsn: 0/4E00E600, desc: CHECKPOINT_REDO wal_level replica; checksums off
lsn: 0/4F0C8B10, prev 0/4F0C6DD8, desc: CHECKSUMS inprogress-on
lsn: 0/52010FA8, desc: CHECKPOINT_REDO wal_level replica; checksums
inprogress-on
lsn: 0/56015400, desc: CHECKPOINT_REDO wal_level replica; checksums
inprogress-on
...
lsn: 0/AA005E00, desc: CHECKPOINT_REDO wal_level replica; checksums
inprogress-on
lsn: 0/AE004988, desc: CHECKPOINT_REDO wal_level replica; checksums
inprogress-on
lsn: 0/B1C7DFC8, prev 0/B1C7DFA0, desc: CHECKSUMS on
lsn: 0/B200BE38, desc: CHECKPOINT_REDO wal_level replica; checksums on
lsn: 0/B28CA0D8, desc: CHECKPOINT_SHUTDOWN redo 0/B28CA0D8; ...
checksums on; shutdown
lsn: 0/B28CA158, desc: CHECKPOINT_SHUTDOWN redo 0/B28CA158; ...
checksums on; shutdown
lsn: 0/B3B2F780, prev 0/B3B2F730, desc: CHECKSUMS inprogress-off
lsn: 0/B3B2F858, prev 0/B3B2F810, desc: CHECKSUMS off
... crash / immediate restart ... (after 0/B3B3F430)
lsn: 0/B47DED08, desc: CHECKPOINT_SHUTDOWN redo 0/B47DED08; ...
checksums off; shutdown
lsn: 0/B5A7A600, prev 0/B5A788C8, desc: CHECKSUMS inprogress-on
The redo starts right after the checkpoint in 0/B28CA158. The failure
happens when trying to apply a FPW record to block in 1663/5/16409, but
the failure is actually in the associated VM, where block 4 has the
wrong checksum. How come?
Grepping the WAL, it seems the last time that VM block was changed is at
LSN 0/A1CF17F0 (which does match the failure log message):
lsn: 0/A1CEF7A0, ... desc: FPI , blkref #0: rel 1663/5/16409 fork vm blk
4 FPW
And AFAICS there are no other WAL records touching this particular vm
page (not even the XLOG_HEAP2_VISIBLE records).
Judging by the CHECKSUMS messages, the LSN 0/A1CEF7A0 is after the
cluster switched to the inprogress-on state, so at that point the
checksum should have been correct, right?
So how come the checksum is not correct when applying the heap page FPW
at LSN 0/B295E930? Either it got broken after enabling checksums in some
way, or it's the redo that got confused about the current "current"
checksum state.
It's a bit unfortunate we only detect invalid checksums if the failures
appear in the server log. I wonder if we could run pg_checksums after
the cluster gets shut down in 'fast' mode (we can't do that with
immediate shutdowns, of course). Didn't we do that in the past,
actually? I suppose there was a reason why we removed it, but maybe we
could still run that at least in some cases?
Anyway, let's assume the checksum was correct after enabling checksums.
That means it had to be broken by the redo, somehow? But why would that
be? The XLOG_CHECKSUMS record says we switched to "inprogress-off" at
0/B3B2F780, and "off" at 0/B3B2F858. Only then we stop updating the
checksums, right?
We "crash" shortly after LSN 0/B3B3F430 (because that's where the
immediate restart is, roughly), and there's a read-write pgbench running
in the background during all this. It might have modified the VM, but
why wouldn't that set the page LSN?
The second checksum failure is a bit more interesting - the VM page has
LSN 0/B29F5A20, which is *after* the redo location (and that checkpoint
has set "checksums on"). And the inprogress-off does not appear until
LSN 0/B3B2F780. So how come the VM page has invalid checksum?
I've only ever seen checksum failured on VM pages, never on any other
object. I have a feeling we're missing some important detail about how
visibility maps work ...
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-02-05 23:32:44 | Re: Decoupling our alignment assumptions about int64 and double |
| Previous Message | Jeremy Schneider | 2026-02-05 23:04:52 | Re: client_connection_check_interval default value |