Re: new heapcheck contrib module

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-06-11 16:14:33
Message-ID: CAFiTN-syyHc3jZoou51v0SR8z0POoNfktqEO6MaGig4YS8mosA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 11, 2020 at 10:51 PM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
> Here is v5 of the patch. Major changes in this version include:
>
> 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database. Internally it functions by querying the database for a list of tables which are appropriate given the command line switches, and then calls amcheck's functions to validate each table and/or index. The options for selecting/excluding tables and schemas is patterned on pg_dump, on the assumption that interface is already familiar to users.
>
> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in which all errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. The new mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the pg_amcheck command line tool.
>
> 3) The regression test which generates corruption within a table uses the pageinspect module to determine the location of each tuple on disk for corrupting. This was suggested upthread.
>
> Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently crashes the backend on certain corruptions. When I corrupt relation files for tables and indexes in the backend and then use pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code. I think I need to harden this code, but wanted to post an updated patch and solicit opinions before doing so. Here are some example problems I'm seeing. Note the stack trace when calling from the command line tool includes the new verify_btreeam function, but you can get the same crashes using the old interface via psql:
>
> From psql, first error:
>
> test=# select bt_index_parent_check('corrupted_idx', true, true);
> TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663)
> 0 postgres 0x0000000106872977 ExceptionalCondition + 103
> 1 postgres 0x00000001063a33e2 _bt_compare + 1090
> 2 amcheck.so 0x0000000106d62921 bt_target_page_check + 6033
> 3 amcheck.so 0x0000000106d5fd2f bt_index_check_internal + 2847
> 4 amcheck.so 0x0000000106d60433 bt_index_parent_check + 67
> 5 postgres 0x00000001064d6762 ExecInterpExpr + 1634
> 6 postgres 0x000000010650d071 ExecResult + 321
> 7 postgres 0x00000001064ddc3d standard_ExecutorRun + 301
> 8 postgres 0x00000001066600c5 PortalRunSelect + 389
> 9 postgres 0x000000010665fc7f PortalRun + 527
> 10 postgres 0x000000010665ed59 exec_simple_query + 1641
> 11 postgres 0x000000010665c99d PostgresMain + 3661
> 12 postgres 0x00000001065d6a8a BackendRun + 410
> 13 postgres 0x00000001065d61c4 ServerLoop + 3044
> 14 postgres 0x00000001065d2fe9 PostmasterMain + 3769
> 15 postgres 0x000000010652e3b0 help + 0
> 16 libdyld.dylib 0x00007fff6725fcc9 start + 1
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG: server process (PID 41309) was terminated by signal 6: Abort trap: 6
>
>
>
> From commandline, second error:
>
> pgtest % pg_amcheck -i test
> (relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 3289393 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 12593 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
>
> <snip>
>
> (relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 306 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 305 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff > lp_len (54 > 34)
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff not max-aligned (54)
> TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319)
> 0 postgres 0x0000000105b22977 ExceptionalCondition + 103
> 1 postgres 0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158
> 2 postgres 0x0000000105634aa1 heapam_index_build_range_scan + 1089
> 3 amcheck.so 0x00000001060100f3 bt_index_check_internal + 3811
> 4 amcheck.so 0x000000010601057c verify_btreeam + 316
> 5 postgres 0x0000000105796266 ExecMakeTableFunctionResult + 422
> 6 postgres 0x00000001057a8c35 FunctionNext + 101
> 7 postgres 0x00000001057bbf3e ExecNestLoop + 478
> 8 postgres 0x000000010578dc3d standard_ExecutorRun + 301
> 9 postgres 0x00000001059100c5 PortalRunSelect + 389
> 10 postgres 0x000000010590fc7f PortalRun + 527
> 11 postgres 0x000000010590ed59 exec_simple_query + 1641
> 12 postgres 0x000000010590c99d PostgresMain + 3661
> 13 postgres 0x0000000105886a8a BackendRun + 410
> 14 postgres 0x00000001058861c4 ServerLoop + 3044
> 15 postgres 0x0000000105882fe9 PostmasterMain + 3769
> 16 postgres 0x00000001057de3b0 help + 0
> 17 libdyld.dylib 0x00007fff6725fcc9 start + 1
> pg_amcheck: error: query failed: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

I have just browsed through the patch and the idea is quite
interesting. I think we can expand it to check that whether the flags
set in the infomask are sane or not w.r.t other flags and xid status.
Some examples are

- If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
should not be set in new_infomask2.
- If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
actually cross verify the transaction status from the CLOG and check
whether is matching the hint bit or not.

While browsing through the code I could not find that we are doing
this kind of check, ignore if we are already checking this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-06-11 16:32:04 Re: Recording test runtimes with the buildfarm
Previous Message Justin Pryzby 2020-06-11 15:37:53 pg_dump, gzwrite, and errno