Re: Changing the state of data checksums in a running cluster

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2026-04-06 17:00:00
Message-ID: b6b0637d-3baf-4a4d-a3b7-9b3558a88d40@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Daniel,

04.04.2026 00:46, Daniel Gustafsson wrote:
> After many more runs on CI I ended up pushing this version, and I see BF
> members being angry due the test not waiting for the launcher to exit. I am
> working on a fix right now.

Maybe this is already known or even expected, but I'd still like to let
you know that starting from f19c0ecca, I'm observing checksum errors in a
running instance. I've modified PageIsVerified() to catch errors sooner:
@@ -158,7 +158,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
     if (checksum_failure)
     {
         if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
-            ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,
+            ereport(PANIC,
                     (errcode(ERRCODE_DATA_CORRUPTED),
                      errmsg("page verification failed, calculated checksum %u but expected %u%s",
                             checksum, p->pd_checksum,

And I'm getting, e.g.:
2026-04-06 18:09:12.077 EEST|postgres|regress_215|69d3cc86.3bfbdc|PANIC:  page verification failed, calculated checksum
40178 but expected 50558, buffer will be zeroed
2026-04-06 18:09:12.077 EEST|postgres|regress_215|69d3cc86.3bfbdc|STATEMENT:  update information_schema.sql_features set
...

Core was generated by `postgres: postgres regress_215 127.0.0.1(42448) UPDATE        '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo(at)entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000796d0004527e in __GI_raise (sig=sig(at)entry=6) at ../sysdeps/posix/raise.c:26
#4  0x0000796d000288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055fe3f92c855 in errfinish (filename=filename(at)entry=0x55fe3fa54bad "bufpage.c", lineno=lineno(at)entry=161,
funcname=funcname(at)entry=0x55fe3fb70ff8 <__func__.6> "PageIsVerified") at elog.c:620
#6  0x000055fe3f7c2415 in PageIsVerified (page=page(at)entry=0x796cf6884000 "", blkno=blkno(at)entry=0, flags=10,
checksum_failure_p=checksum_failure_p(at)entry=0x7ffd2c524bef) at bufpage.c:161
#7  0x000055fe3f78a93d in buffer_readv_complete_one (zeroed_buffer=<synthetic pointer>, ignored_checksum=<synthetic
pointer>, failed_checksum=0x7ffd2c524bef, buffer_invalid=<synthetic pointer>, is_temp=false, failed=false, flags=9 '\t',
buffer=15424, buf_off=0 '\000', td=0x796cfc69c2d8) at bufmgr.c:8593
#8  buffer_readv_complete (is_temp=false, cb_data=<optimized out>, prior_result=..., ioh=<optimized out>) at bufmgr.c:8724
#9  shared_buffer_readv_complete (ioh=<optimized out>, prior_result=..., cb_data=<optimized out>) at bufmgr.c:8883
#10 0x000055fe3f77ec61 in pgaio_io_call_complete_shared (ioh=ioh(at)entry=0x796cfc69c260) at aio_callback.c:258
#11 0x000055fe3f77d4f6 in pgaio_io_process_completion (ioh=ioh(at)entry=0x796cfc69c260, result=<optimized out>) at aio.c:540
#12 0x000055fe3f77fe42 in pgaio_io_perform_synchronously (ioh=ioh(at)entry=0x796cfc69c260) at aio_io.c:146
#13 0x000055fe3f77e121 in pgaio_io_stage (ioh=ioh(at)entry=0x796cfc69c260, op=op(at)entry=PGAIO_OP_READV) at aio.c:476
#14 0x000055fe3f77fd6d in pgaio_io_start_readv (ioh=ioh(at)entry=0x796cfc69c260, fd=166, iovcnt=iovcnt(at)entry=1,
offset=offset(at)entry=0) at aio_io.c:87
#15 0x000055fe3f795bae in FileStartReadV (ioh=ioh(at)entry=0x796cfc69c260, file=<optimized out>, iovcnt=iovcnt(at)entry=1,
offset=offset(at)entry=0, wait_event_info=wait_event_info(at)entry=167772183) at fd.c:2225
#16 0x000055fe3f7c648b in mdstartreadv (ioh=0x796cfc69c260, reln=0x55fe73aeba98, forknum=VISIBILITYMAP_FORKNUM,
blocknum=0, buffers=<optimized out>, nblocks=1) at md.c:1041
#17 0x000055fe3f7c809c in smgrstartreadv (ioh=ioh(at)entry=0x796cfc69c260, reln=<optimized out>,
forknum=forknum(at)entry=VISIBILITYMAP_FORKNUM, blocknum=blocknum(at)entry=0, buffers=buffers(at)entry=0x7ffd2c524e70,
nblocks=nblocks(at)entry=1) at smgr.c:758
#18 0x000055fe3f78a1c7 in AsyncReadBuffers (operation=operation(at)entry=0x7ffd2c5253a0,
nblocks_progress=nblocks_progress(at)entry=0x7ffd2c52530c) at bufmgr.c:2144
#19 0x000055fe3f78ce19 in StartReadBuffersImpl (allow_forwarding=false, flags=9, nblocks=0x7ffd2c52530c, blockNum=0,
buffers=0x7ffd2c52539c, operation=0x7ffd2c5253a0) at bufmgr.c:1548
#20 StartReadBuffer (operation=operation(at)entry=0x7ffd2c5253a0, buffer=buffer(at)entry=0x7ffd2c52539c,
blocknum=blocknum(at)entry=0, flags=9) at bufmgr.c:1636
#21 0x000055fe3f78d870 in ReadBuffer_common (strategy=0x0, mode=RBM_ZERO_ON_ERROR, blockNum=0,
forkNum=VISIBILITYMAP_FORKNUM, smgr_persistence=0 '\000', smgr=0x55fe73aeba98, rel=0x796d006a31a8) at bufmgr.c:1358
#22 ReadBufferExtended (reln=reln(at)entry=0x796d006a31a8, forkNum=forkNum(at)entry=VISIBILITYMAP_FORKNUM,
blockNum=blockNum(at)entry=0, mode=mode(at)entry=RBM_ZERO_ON_ERROR, strategy=strategy(at)entry=0x0) at bufmgr.c:945
#23 0x000055fe3f3d7e00 in vm_readbuf (rel=rel(at)entry=0x796d006a31a8, blkno=blkno(at)entry=0, extend=extend(at)entry=true) at
visibilitymap.c:577
#24 0x000055fe3f3d7fda in visibilitymap_pin (rel=rel(at)entry=0x796d006a31a8, heapBlk=<optimized out>,
vmbuf=vmbuf(at)entry=0x55fe73ed2b18) at visibilitymap.c:216
#25 0x000055fe3f3d1f7a in heap_page_prune_opt (relation=0x796d006a31a8, buffer=buffer(at)entry=15403,
vmbuffer=vmbuffer(at)entry=0x55fe73ed2b18, rel_read_only=false) at pruneheap.c:339
#26 0x000055fe3f3c1dcf in heap_prepare_pagescan (sscan=sscan(at)entry=0x55fe73ed2a88) at heapam.c:636
#27 0x000055fe3f3c242f in heapgettup_pagemode (scan=scan(at)entry=0x55fe73ed2a88, dir=ForwardScanDirection, nkeys=0,
key=0x0) at heapam.c:1111
#28 0x000055fe3f3c27ab in heap_getnextslot (sscan=0x55fe73ed2a88, direction=<optimized out>, slot=0x55fe73ed13a8) at
heapam.c:1467
#29 0x000055fe3f5e8d62 in table_scan_getnextslot (sscan=<optimized out>, direction=direction(at)entry=ForwardScanDirection,
slot=slot(at)entry=0x55fe73ed13a8) at ../../../src/include/access/tableam.h:1099
#30 0x000055fe3f5e939e in SeqNext (node=0x55fe73ed1188) at nodeSeqscan.c:83
#31 ExecScanFetch (recheckMtd=0x55fe3f5e8d2e <SeqRecheck>, accessMtd=0x55fe3f5e8c9c <SeqNext>, epqstate=0x0,
node=0x55fe73ed1188) at ../../../src/include/executor/execScan.h:135
#32 ExecScanExtended (projInfo=0x55fe73ed19d8, qual=0x0, epqstate=0x0, recheckMtd=0x55fe3f5e8d2e <SeqRecheck>,
accessMtd=0x55fe3f5e8c9c <SeqNext>, node=0x55fe73ed1188) at ../../../src/include/executor/execScan.h:196
#33 ExecSeqScanWithProject (pstate=<optimized out>) at nodeSeqscan.c:164
#34 0x000055fe3f5b65f9 in ExecProcNodeFirst (node=0x55fe73ed1188) at execProcnode.c:470
#35 0x000055fe3f5dfd3b in ExecProcNode (node=node(at)entry=0x55fe73ed1188) at ../../../src/include/executor/executor.h:320
...

2026-04-06 18:09:12.289 EEST|postgres|regress_147|69d3cc44.3bfaaa|PANIC:  page verification failed, calculated checksum
8769 but expected 0
2026-04-06 18:09:12.289 EEST|postgres|regress_147|69d3cc44.3bfaaa|STATEMENT:  insert into
information_schema.sql_features values (
...

Core was generated by `postgres: postgres regress_147 127.0.0.1(35968) INSERT        '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo(at)entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000796d0004527e in __GI_raise (sig=sig(at)entry=6) at ../sysdeps/posix/raise.c:26
#4  0x0000796d000288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055fe3f92c855 in errfinish (filename=filename(at)entry=0x55fe3fa54bad "bufpage.c", lineno=lineno(at)entry=161,
funcname=funcname(at)entry=0x55fe3fb70ff8 <__func__.6> "PageIsVerified") at elog.c:620
#6  0x000055fe3f7c2415 in PageIsVerified (page=page(at)entry=0x796cf0c80000 "", blkno=blkno(at)entry=21, flags=2,
checksum_failure_p=checksum_failure_p(at)entry=0x7ffd2c526b1f) at bufpage.c:161
#7  0x000055fe3f78a93d in buffer_readv_complete_one (zeroed_buffer=<synthetic pointer>, ignored_checksum=<synthetic
pointer>, failed_checksum=0x7ffd2c526b1f, buffer_invalid=<synthetic pointer>, is_temp=false, failed=false, flags=8 '\b',
buffer=3646, buf_off=0 '\000', td=0x796cfc63e7c8) at bufmgr.c:8593
#8  buffer_readv_complete (is_temp=false, cb_data=<optimized out>, prior_result=..., ioh=<optimized out>) at bufmgr.c:8724
#9  shared_buffer_readv_complete (ioh=<optimized out>, prior_result=..., cb_data=<optimized out>) at bufmgr.c:8883
#10 0x000055fe3f77ec61 in pgaio_io_call_complete_shared (ioh=ioh(at)entry=0x796cfc63e750) at aio_callback.c:258
#11 0x000055fe3f77d4f6 in pgaio_io_process_completion (ioh=ioh(at)entry=0x796cfc63e750, result=<optimized out>) at aio.c:540
#12 0x000055fe3f77fe42 in pgaio_io_perform_synchronously (ioh=ioh(at)entry=0x796cfc63e750) at aio_io.c:146
#13 0x000055fe3f77e121 in pgaio_io_stage (ioh=ioh(at)entry=0x796cfc63e750, op=op(at)entry=PGAIO_OP_READV) at aio.c:476
#14 0x000055fe3f77fd6d in pgaio_io_start_readv (ioh=ioh(at)entry=0x796cfc63e750, fd=199, iovcnt=iovcnt(at)entry=1,
offset=offset(at)entry=172032) at aio_io.c:87
#15 0x000055fe3f795bae in FileStartReadV (ioh=ioh(at)entry=0x796cfc63e750, file=<optimized out>, iovcnt=iovcnt(at)entry=1,
offset=offset(at)entry=172032, wait_event_info=wait_event_info(at)entry=167772183) at fd.c:2225
#16 0x000055fe3f7c648b in mdstartreadv (ioh=0x796cfc63e750, reln=0x55fe73b833b8, forknum=MAIN_FORKNUM, blocknum=21,
buffers=<optimized out>, nblocks=1) at md.c:1041
#17 0x000055fe3f7c809c in smgrstartreadv (ioh=ioh(at)entry=0x796cfc63e750, reln=<optimized out>,
forknum=forknum(at)entry=MAIN_FORKNUM, blocknum=blocknum(at)entry=21, buffers=buffers(at)entry=0x7ffd2c526da0,
nblocks=nblocks(at)entry=1) at smgr.c:758
#18 0x000055fe3f78a1c7 in AsyncReadBuffers (operation=operation(at)entry=0x7ffd2c5272d0,
nblocks_progress=nblocks_progress(at)entry=0x7ffd2c52723c) at bufmgr.c:2144
#19 0x000055fe3f78ce19 in StartReadBuffersImpl (allow_forwarding=false, flags=8, nblocks=0x7ffd2c52723c, blockNum=21,
buffers=0x7ffd2c5272cc, operation=0x7ffd2c5272d0) at bufmgr.c:1548
#20 StartReadBuffer (operation=operation(at)entry=0x7ffd2c5272d0, buffer=buffer(at)entry=0x7ffd2c5272cc,
blocknum=blocknum(at)entry=21, flags=8) at bufmgr.c:1636
#21 0x000055fe3f78d870 in ReadBuffer_common (strategy=0x0, mode=RBM_NORMAL, blockNum=21, forkNum=MAIN_FORKNUM,
smgr_persistence=0 '\000', smgr=0x55fe73b833b8, rel=0x796d0066aa18) at bufmgr.c:1358
#22 ReadBufferExtended (reln=0x796d0066aa18, forkNum=forkNum(at)entry=MAIN_FORKNUM, blockNum=blockNum(at)entry=21,
mode=mode(at)entry=RBM_NORMAL, strategy=strategy(at)entry=0x0) at bufmgr.c:945
#23 0x000055fe3f3ce074 in ReadBufferBI (relation=relation(at)entry=0x796d0066aa18, targetBlock=targetBlock(at)entry=21,
mode=mode(at)entry=RBM_NORMAL, bistate=bistate(at)entry=0x0) at hio.c:93
#24 0x000055fe3f3cea30 in RelationGetBufferForTuple (relation=relation(at)entry=0x796d0066aa18, len=24,
otherBuffer=otherBuffer(at)entry=0, options=options(at)entry=0, bistate=bistate(at)entry=0x0,
vmbuffer=vmbuffer(at)entry=0x7ffd2c527468, vmbuffer_other=0x0, num_pages=1) at hio.c:617
#25 0x000055fe3f3bcb50 in heap_insert (relation=relation(at)entry=0x796d0066aa18, tup=tup(at)entry=0x55fe73be9638,
cid=cid(at)entry=0, options=options(at)entry=0, bistate=bistate(at)entry=0x0) at heapam.c:2179
#26 0x000055fe3f3c7c82 in heapam_tuple_insert (relation=0x796d0066aa18, slot=0x55fe73be9528, cid=0, options=0,
bistate=0x0) at heapam_handler.c:267
#27 0x000055fe3f5e2da2 in table_tuple_insert (bistate=0x0, options=0, cid=<optimized out>, slot=0x55fe73be9528,
rel=0x796d0066aa18) at ../../../src/include/access/tableam.h:1456
#28 ExecInsert (context=context(at)entry=0x7ffd2c527620, resultRelInfo=resultRelInfo(at)entry=0x55fe737c8b00,
slot=0x55fe73be9528, canSetTag=true, inserted_tuple=inserted_tuple(at)entry=0x0, insert_destrel=insert_destrel(at)entry=0x0)
at nodeModifyTable.c:1272
#29 0x000055fe3f5e5542 in ExecModifyTable (pstate=0x55fe737c88f0) at nodeModifyTable.c:4933
#30 0x000055fe3f5b65f9 in ExecProcNodeFirst (node=0x55fe737c88f0) at execProcnode.c:470
...

I reproduce it rather easily (within 30 minutes) with 600 instances of
"sqlsmith --max-queries=1000" running against separate empty databases, on
my workstation with Ryzen 7900. I think I can compose a self-contained
repro, if needed... If you need more information/diagnostics, I'd be glad
to help.

Best regards,
Alexander

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2026-04-06 17:03:08 Re: PG 19 release notes and authors
Previous Message Andrew Dunstan 2026-04-06 16:51:40 Re: Add errdetail() with PID and UID about source of termination signal