Re: amcheck: add index-all-keys-match verification for B-Tree

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Wenbo Lin <linwenbo1994(at)gmail(dot)com>
Subject: Re: amcheck: add index-all-keys-match verification for B-Tree
Date: 2026-04-08 13:17:45
Message-ID: CAKZiRmwCT=pqesAjC4-2rRLWQ2uiBkrmJBhWeTgALn6yfqbuew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 5, 2026 at 7:00 AM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.

Hi Andrey and Wenboo,

1. I tried the v4 to see how it would cope with heap missing segments[0]
(with -Dsegsize_blocks=6), so given:

create table ptest(id bigint, tenant_id bigint);
insert into ptest select g, mod(g,10) from generate_series(1, 3000) g;
alter table ptest add primary key (id);
checkpoint;

postgres=# select count(*) from ptest;
count
-------
3000

-- simulate loss of middle segment:
$ mv base/5/24578.1 base/5/24578.1.ORG
$ ls -1 base/5/24578*
base/5/24578
base/5/24578.1.ORG
base/5/24578.2
base/5/24578_fsm
base/5/24578_vm

-- force _mfd_openseg() via new backend to show silent data loss:
postgres=# select count(*) from ptest;
count
-------
1110

-- try new amcheck
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>false);
bt_index_check
----------------

-- try the new indexallkeysmatch
-- might need restart before it is able to properly bailout
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>true);
ERROR: could not open file "base/5/24578.1" (target block 6): No such
file or directory

this error was not quite expected for me (altough it is much better than nothing
as on master today), because patch v4-0002 assumes in
bt_verify_index_tuple_points_to_heap() that we should get false and error in
more human-firendly way ("index tuple points to non-existent heap"). The error
itself is coming out of the following stacktrace:

_mdfd_getseg (forknum=MAIN_FORKNUM, blkno=6)
<- mdstartreadv (..) <- smgrstartreadv (blocknum=blocknum(at)entry=6 (!))
<- StartReadBuffersImpl (blockNum=6 (!)) <- StartReadBuffer()
<- ReadBuffer_common() <- ReadBufferExtended (blockNum=6 (!))
<- heap_fetch() <- heapam_fetch_row_version()
<- table_tuple_fetch_row_version()
<- bt_verify_index_tuple_points_to_heap(targetblock=5, offset=14)
<- bt_target_page_check() <- bt_check_level_from_leftmost()
<- bt_check_every_level() <- bt_index_check_callback()
<- amcheck_lock_relation_and_check() <- bt_index_check()
and it's because itup in bt_target_page_check() has bi_lo = 6, ip_posid = 1

So with the attached patch it becomes more human understandable, but it is
somehow orthogonal check? Still maybe we can combine this under this $thread
ambrella and make it as an option of indexallkeysmatch too? Attached
behaves likes this:

postgres=# select bt_index_check(index=>'ptest_pkey');
ERROR: index line pointer in index "ptest_pkey" points to missing
page in table "ptest"
DETAIL: Index tid=(5,14) points to heap tid=(6,1) but heap has only 6 blocks.
HINT: this can be caused by lost relation segment (missing or removed file).

(note it does not need to indexallkeysmatch, but it's related)

Also, the attached patch right now blows up pg_amcheck test (not
contrib/amcheck), where we ask to ignore certain corrupted schemas/tables to
test exclusion logic, but we still verify btrees and with attached this btree
verification ends up discovering short heap segment :D I honestly do not know
what to do in this situation (we ask to ignore one table T, but still launch
bt_index_check() on it's index and discover illegal sitation about state of T).

Also I'm not sure if the rechecking of heapnblocks is good way. It's just some
idea.

2. For some reason email from Wenbo didn't get to my emailbox, but I can
see his email here:
https://www.postgresql.org/message-id/resend/ba92ac77-24f1-44ad-abf0-11e64e0a7831%40gmail.com
(and resending is is somewhat not effective)

> I've been reviewing the patch and ran some concurrent tests against it.
> I found an issue with false positive corruption reports under concurrent
> VACUUM.
>
> + slot = table_slot_create(state->heaprel, NULL);
> + found = table_tuple_fetch_row_version(state->heaprel, tid,
> + SnapshotAny, slot);
> + if (!found)
> + {
> + ExecDropSingleTupleTableSlot(slot);
> + ereport(ERROR,
>
>
> bt_verify_index_tuple_points_to_heap uses SnapshotAny with
> table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
> "tuple exists but is dead". However, bt_index_check only holds
> AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
> A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
> Bloom filter probe and the heap lookup. Causing
> table_tuple_fetch_row_version to return false for a legal
> dead-but-now-pruned tuple and a false positive corruption report.
>
> The table_tuple_satisfies_snapshot check does not help, it only runs
> when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
> heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
> snapshot checks.
>
> The reproduce should be: DELETE all rows form a big table, then launch
> VACUUM and bt_index_check concurrently. A POC test script attached.

I think this is spot-on. I haven't run repro, but if we take ASL and VACUUM
is allowed to run (under ShareUpdateExclusiveLock) then amcheck performs the
heap bloom building (but ignores dead tuples, while btree still can
point to those),
but VACUUM could later remove the heap tuple. It seems to be impossible (under
ASL) to differentiate orphaned index tuple from concurrently vacuumed
dead tuple?

Maybe we should grab ShareUpdateExclusiveLock in bt_index_check for
indexallkeysmatch==true? With some new documentation warning about
this? (to block
just VACUUM, but not DML)

Alternative is to do it via bt_index_parent_check()/ShareLock, but it's kind of
hardly acceptable in many sitations.

3. This is more a question than finding: assuming extremly big tables, wouldn't
we benefit from some form of caching for EState in
bt_verify_index_tuple_points_to_heap()? Or we do not care about such stuff
in amcheck? (we seem to re-create Estate with each call to FormIndexDatum())

-J.

[0] - https://www.postgresql.org/message-id/flat/013D63E2-5D75-492E-85FF-1D5CC0148C82%40gmail.com

Attachment Content-Type Size
nocfbot_v4b-0001-amcheck-attempt-to-detect-lost-heap-segments-bas.patch text/x-patch 8.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2026-04-08 13:23:43 Re: BF client script runs src/test/modules TAP tests multiple times
Previous Message Corey Huinker 2026-04-08 13:11:23 Re: Import Statistics in postgres_fdw before resorting to sampling.