Possible Visibility Map corruption in supported branches?

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Possible Visibility Map corruption in supported branches?
Date: 2026-06-26 10:19:55
Message-ID: 20260626121955.1431eaca@karst
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi there,

We've been working on what appeared to be a corrupted Visibility Map with one of
our customer.

First part of our mission was to fix the production cluster. We are now
entering the investigation part to find out how this has been triggered.

The cluster is running the unsupported 13.23 without checksum enabled, but I
suspect they would not have helped anyway.

I'm reaching pgsql-hackers because last minor version is from November 2025 and
unless I'm wrong, no bug related to VM has been fixed in current supported
releases since then, so they might be subject to the same bug as well (if this
is a bug). The v19 might not be subject to this though, as Melanie Plageman
refactored some code around VM there.

1. How customer discovered the corruption

Customer is doing a daily vacuum freeze analyze at 2AM using:

vacuumdb -a -v -z -F --jobs=32

The 4th of June, he found in vacuumdb collected logfile the following error:

INFO: aggressively vacuuming "<TABLE_NAME>"

INFO: scanned index "<TABLE_NAME>_key" to remove 524506 row versions
DETAIL: CPU: user: 2.35 s, system: 0.88 s, elapsed: 6.14 s

INFO: scanned index "<TABLE_NAME>_unique" to remove 524506 row versions
DETAIL: CPU: user: 6.12 s, system: 2.87 s, elapsed: 122.03 s

INFO: "<TABLE_NAME>": removed 524506 row versions in 101141 pages
DETAIL: CPU: user: 0.87 s, system: 4.16 s, elapsed: 221.80 s

INFO: index "<TABLE_NAME>_key" now contains 80356747 row versions in 257645
pages
DETAIL: 524506 index row versions were removed.
35771 index pages have been deleted, 34390 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

INFO: index "<TABLE_NAME>_unique" now contains 80288293 row versions in
533979 pages
DETAIL: 3899 index row versions were removed.
9553 index pages have been deleted, 9553 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

INFO: "<TABLE_NAME>": found 524506 removable, 10300212 nonremovable row
versions in 489069 out of 3919859 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 3706326546
There were 22523 unused item identifiers.
Skipped 0 pages due to buffer pins, 3430790 frozen pages. 0 pages are
entirely empty.
CPU: user: 6.26 s, system: 13.74 s, elapsed: 1149.83 s.

INFO: analyzing "<TABLE_NAME>"
vacuumdb: error: processing of database "<DBNAME>" failed:
ERROR: could not access status of transaction 3561846225
DETAIL: Could not open file "pg_xact/0D44": No such file or directory.

2. Details and Production fix

The minimal datfrozenxid in the cluster was 3596946949, pointing to the 0D66
xact file, so it seemed accurate that 0D44 was already removed.

Looking at the oldest vacuumdb log file kept from the 17th of May, the oldest
xmin found in the broken table was 3591833207, so we were inside 0D61 by this
time. The 4th of June, it was 3706326546, inside 0DCE. A rough computation
gives us around 6 xacts files per day. As we were in 0D61 the 17th of May, we
were around 0D44 the 13th of May.

Because some rows were older than the reported relfrozenxid on the table, we
thought a VM corruption could keep the vacuum process away from these blocks
and indeed, "pg_check_frozen()" found 68454 broken rows. All theses rows
came from the blocks 3691935 to 3694483 (about 20MB).

We extracted the related blocks using "dd" to studied them with "pg_filedump".
They ONLY contain the 68454 corrupted rows. No other rows. Also, checking rows
headers showed:

* each row was inserted in its own transaction (68454 xmin for 68454 rows)
* they all have an xmin inside 0D44
* checking inside 0D44 showed ALL these xact were ABORTED. None of them has been
COMMITTED.
* 58812 rows are flagged
HEAP_HASNULL, HEAP_HASVARWIDTH, HEAP_XMAX_INVALID
* 9642 rows are flagged
HEAP_HASNULL, HEAP_HASVARWIDTH,HEAP_XMIN_INVALID, HEAP_XMAX_INVALID

Looking at a creation field inside the customer data, we found they were created
between 2026-05-13 19:02:27 and 2026-05-13 19:06:53, which fall inside our rough
computation earlier. The customer explained to us that this table is new, has
been filled with 79 millions rows the 21st of April, then a batch import some
more lines on a daily basis.

As the blocks only contained the corrupted rows and nothing else, and as all
these rows are ABORTED, we tested the following procedure offline, then
customer applied it in production:

* restore "pg_xact/0D44"
* truncate the related VM
* vacuum the table
* pg_check_visible() and pg_check_frozen() the whole database (nothing found)
* bt_index_parent_check() the related index (nothing found)

Now, we can start investigating how this happen.

3. Investigation

We suspect a bug, not a physical corruption, as the blocks content looks
sensible: the data are OK, the XMINs seems accurate with the data and how fast
they increment in the cluster.

Also, rows having HEAP_XMIN_INVALID are found in only 359 blocks among the
2547 corrupted blocks. So it doesn't seems like the action that did set this
flag is guilty to some race condition or bug around the visibility map. Not
alone at least.

It looks like the only trouble comes from the VM, pointing these blocks as
visible/frozen where they actually have dead non-cleaned up rows. The scenario
we could imagine are:

1. vacuum did not clean the blocks content in the heap (as all rows are
ABORTED) before setting the VM, or its writes has been lost, but this seems
really not probable
2. the visibility/frozen bits was wrongly set by some means
3. the visibility/frozen bits was already set and blocks were clean before the
13th of May's batch but has not been unset during the writes, maybe related
to the rollbacks?

The only physical corruption I could imagine would be that the blocks in heap
or VM were somehow reverted to their previous state **after** the vacuum, but:

* the FS is ext4
* sysadmins has been investigating the host way back to early May and it seems
they hadn't find anything wrong
* no PITR restoration
* last switchover in March
* no crash before the error appeared in vacuumdb's logs the 4th of June
* last but not least: the corruption has been replicated to the replicas
according to customer

Unfortunately, we have neither logs from autovacuum or daily vacuumdb from the
13th of May any more, so we can not check its activity during/after this bad
batch.

I'm planing to parse WAL files archived around the 2026-05-13 to look for the
batch writes and the related Heap2/VISIBLE records to the corrupted blocks. If
I can find the writes but not the Heap2/VISIBLE records, I suppose the 3rd
scenario would gain some weight. However, archives are not available yet as
they were removed from the PITR repository and the customer must extract them
from a (very) cold backup first.

We are currently looking for any tips or other paths to investigate. The
customer keeps a copy of a PITR backup to work on it.

Regards,

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2026-06-26 10:30:32 Re: Proposal: Conflict log history table for Logical Replication
Previous Message Amit Kapila 2026-06-26 10:09:54 Re: Proposal: Conflict log history table for Logical Replication