From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Daniel Gustafsson <daniel(at)yesql(dot)se>, Bernd Helmle <mailings(at)oopsware(dot)de> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Changing the state of data checksums in a running cluster |
Date: | 2025-08-20 14:37:33 |
Message-ID: | 830e4296-dbb7-4b5c-be51-64732591f6c8@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/16/25 21:34, Daniel Gustafsson wrote:
> Attached is a rebase on top of the func.sgml changes which caused this to no
> longer apply.
>
> This version is also substantially updated with a new injection point based
> test suite, fixed a few bugs (found by said test suite), added checkpoint to
> disabling checksums, code cleanup, more granular wait events, comment rewrites
> and additions and more smaller cleanups.
>
Thanks for the updated patch.
The injection points seem like a huge improvement, allowing testing of
different code paths in a more deterministic way.
I started running the stress test, using pretty much exactly the version
posted in March [1]. And so far I noticed only one issue, when the
standby reports mismatched checksums on a fsm:
LOG: page verification failed, calculated checksum 24786 but expected 24760
CONTEXT: WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
LOG: invalid page in block 2 of relation base/16384/16403_fsm; zeroing
out page
CONTEXT: WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
WARNING: invalid page in block 2 of relation base/16384/16403_fsm;
zeroing out page
CONTEXT: WAL redo at 0/0344A290 for Heap2/MULTI_INSERT+INIT: ntuples:
185, flags: 0x28; blkref #0: rel 1663/16384/16403, blk 0
LOG: page verification failed, calculated checksum 37048 but expected 0
CONTEXT: WAL redo at 0/0344D7E0 for Heap2/MULTI_INSERT+INIT: ntuples:
61, flags: 0x28; blkref #0: rel 1663/16384/16400, blk 0
LOG: invalid page in block 2 of relation base/16384/16400_fsm; zeroing
out page
This happens quite regularly, it's not hard to hit. But I've only seen
it to happen on a FSM, and only right after immediate shutdown. I don't
think that's quite expected.
I believe the built-in TAP tests (with injection points) can't catch
this, because there's no concurrent activity while flipping checksums
on/off. It'd be good to do something like that, by running pgbench in
the background, or something like that.
I also don't see any restarts of the primary/standby. That might be good
to do too.
I plan to randomize the stress test a bit more, once this FSM issue gets
fixed. Maybe that'll find some additional issues.
[1]
https://www.postgresql.org/message-id/f528413c-477a-4ec3-a0df-e22a80ffbe41@vondra.me
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Jones | 2025-08-20 15:37:50 | Add GUC to enable libxml2's XML_PARSE_HUGE |
Previous Message | Antonin Houska | 2025-08-20 14:22:41 | Re: Adding REPACK [concurrently] |