| From: | Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: [PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill |
| Date: | 2026-05-05 15:11:54 |
| Message-ID: | CAFcNs+paBxgmJMsFH5okMWWmt6KnJVAHCQKbNe+sB=n+JdYZmQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Apr 20, 2026 at 5:20 PM Fabrízio de Royes Mello <
fabriziomello(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> I found a bug in pg_surgery's heap_force_kill that can produce WAL
records with incorrect CRC checksums when multiple sessions operate on heap
pages that share the same visibility map page.
>
> When heap_force_kill kills a tuple on an all-visible page, it calls
visibilitymap_clear() to clear the VM bits, then later calls
log_newpage_buffer(vmbuf) to write a full-page image of the VM page to WAL.
However, visibilitymap_clear() releases the exclusive lock on the VM buffer
before returning -- that's its normal API contract. The subsequent
log_newpage_buffer(vmbuf) then writes the FPI without holding any content
lock on the VM buffer.
>
> XLogInsert reads the page content twice via the XLogRecData chain pointer
to shared memory: once to compute the CRC in XLogRecordAssemble, and once
to copy the data to WAL buffers in CopyXLogRecordToWAL. If a concurrent
backend modifies the VM page between those two reads (e.g., another
heap_force_kill session clearing a different bit on the same VM page), the
CRC will not match the written bytes.
>
> Fix it by re-acquiring the VM buffer exclusive lock immediately after
visibilitymap_clear() returns, and release it after log_newpage_buffer()
completes. The lock is only acquired when RelationNeedsWAL() is true, since
unlogged relations skip the FPI write entirely.
>
> The patch includes a TAP test that uses injection points to
deterministically reproduce the race condition. Three injection points are
used:
> * "heap-force-kill-vm-pin" in heap_surgery.c (outside the critical
section) to initialize DSM shared memory for the wait machinery.
> * "heap-force-kill-before-vm-wal" in heap_surgery.c (inside the critical
section, between the heap FPI and VM FPI writes) as a synchronization
barrier.
> * "wal-insert-after-crc" in xloginsert.c (inside XLogInsert, between
XLogRecordAssemble and XLogInsertRecord) to pause after CRC computation but
before the data copy.
>
> The test pauses session 1 inside XLogInsert after the VM FPI's CRC is
computed, runs a concurrent heap_force_kill from session 2 on the same VM
page, then wakes session 1. Without the fix, session 2 modifies the VM page
between CRC and copy, and pg_walinspect reports "incorrect resource manager
data checksum". With the fix, session 2 blocks on the VM buffer lock and no
corruption occurs.
>
> The "wal-insert-after-crc" injection point in xloginsert.c uses
INJECTION_POINT_CACHED (pre-loaded by the caller before the critical
section), so it only fires when explicitly loaded and adds no overhead to
the normal WAL write path.
>
Rebased version.
--
Fabrízio de Royes Mello
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-pg_surgery-Fix-WAL-corruption-from-concurrent-heap_f.patch | application/octet-stream | 14.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ayush Tiwari | 2026-05-05 15:21:22 | Re: Changing the state of data checksums in a running cluster |
| Previous Message | Alvaro Herrera | 2026-05-05 15:04:42 | Re: Adding REPACK [concurrently] |