Re: new heapcheck contrib module

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-05-11 17:21:22
Message-ID: 5F743835-3399-419C-8324-2D424237E999@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here is v5 of the patch. Major changes in this version include:

1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database. Internally it functions by querying the database for a list of tables which are appropriate given the command line switches, and then calls amcheck's functions to validate each table and/or index. The options for selecting/excluding tables and schemas is patterned on pg_dump, on the assumption that interface is already familiar to users.

2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in which all errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. The new mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the pg_amcheck command line tool.

3) The regression test which generates corruption within a table uses the pageinspect module to determine the location of each tuple on disk for corrupting. This was suggested upthread.

Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently crashes the backend on certain corruptions. When I corrupt relation files for tables and indexes in the backend and then use pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code. I think I need to harden this code, but wanted to post an updated patch and solicit opinions before doing so. Here are some example problems I'm seeing. Note the stack trace when calling from the command line tool includes the new verify_btreeam function, but you can get the same crashes using the old interface via psql:

From psql, first error:

test=# select bt_index_parent_check('corrupted_idx', true, true);
TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663)
0 postgres 0x0000000106872977 ExceptionalCondition + 103
1 postgres 0x00000001063a33e2 _bt_compare + 1090
2 amcheck.so 0x0000000106d62921 bt_target_page_check + 6033
3 amcheck.so 0x0000000106d5fd2f bt_index_check_internal + 2847
4 amcheck.so 0x0000000106d60433 bt_index_parent_check + 67
5 postgres 0x00000001064d6762 ExecInterpExpr + 1634
6 postgres 0x000000010650d071 ExecResult + 321
7 postgres 0x00000001064ddc3d standard_ExecutorRun + 301
8 postgres 0x00000001066600c5 PortalRunSelect + 389
9 postgres 0x000000010665fc7f PortalRun + 527
10 postgres 0x000000010665ed59 exec_simple_query + 1641
11 postgres 0x000000010665c99d PostgresMain + 3661
12 postgres 0x00000001065d6a8a BackendRun + 410
13 postgres 0x00000001065d61c4 ServerLoop + 3044
14 postgres 0x00000001065d2fe9 PostmasterMain + 3769
15 postgres 0x000000010652e3b0 help + 0
16 libdyld.dylib 0x00007fff6725fcc9 start + 1
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG: server process (PID 41309) was terminated by signal 6: Abort trap: 6

From commandline, second error:

pgtest % pg_amcheck -i test
(relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmin = 3289393 is in the future
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmax = 0 precedes relation relminmxid = 1
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmin = 12593 is in the future
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)

<snip>

(relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmin = 306 precedes relation relfrozenxid = 487
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmax = 0 precedes relation relminmxid = 1
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmin = 305 precedes relation relfrozenxid = 487
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
t_hoff > lp_len (54 > 34)
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
t_hoff not max-aligned (54)
TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319)
0 postgres 0x0000000105b22977 ExceptionalCondition + 103
1 postgres 0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158
2 postgres 0x0000000105634aa1 heapam_index_build_range_scan + 1089
3 amcheck.so 0x00000001060100f3 bt_index_check_internal + 3811
4 amcheck.so 0x000000010601057c verify_btreeam + 316
5 postgres 0x0000000105796266 ExecMakeTableFunctionResult + 422
6 postgres 0x00000001057a8c35 FunctionNext + 101
7 postgres 0x00000001057bbf3e ExecNestLoop + 478
8 postgres 0x000000010578dc3d standard_ExecutorRun + 301
9 postgres 0x00000001059100c5 PortalRunSelect + 389
10 postgres 0x000000010590fc7f PortalRun + 527
11 postgres 0x000000010590ed59 exec_simple_query + 1641
12 postgres 0x000000010590c99d PostgresMain + 3661
13 postgres 0x0000000105886a8a BackendRun + 410
14 postgres 0x00000001058861c4 ServerLoop + 3044
15 postgres 0x0000000105882fe9 PostmasterMain + 3769
16 postgres 0x00000001057de3b0 help + 0
17 libdyld.dylib 0x00007fff6725fcc9 start + 1
pg_amcheck: error: query failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Attachment Content-Type Size
v5-0001-Adding-verify_heapam-and-pg_amcheck.patch application/octet-stream 157.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-05-11 17:41:00 Re: PG 13 release notes, first draft
Previous Message Alvaro Herrera 2020-05-11 16:25:54 Re: 2pc leaks fds