Re: amcheck (B-Tree integrity checking tool)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Kevin Grittner <kgrittn(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2016-11-16 21:03:03
Message-ID: 20161116210303.a2oaelkk2vavr5vq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I think the patch could use a pgindent run.

On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote:
> diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
> new file mode 100644
> index 0000000..ebbd6ac
> --- /dev/null
> +++ b/contrib/amcheck/amcheck--1.0.sql
> @@ -0,0 +1,20 @@
> +/* contrib/amcheck/amcheck--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
> +
> +--
> +-- bt_index_check()
> +--
> +CREATE FUNCTION bt_index_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT;
> +
> +--
> +-- bt_index_parent_check()
> +--
> +CREATE FUNCTION bt_index_parent_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_parent_check'
> +LANGUAGE C STRICT;

I'd really want a function that runs all check on a table.

> +#define CHECK_SUPERUSER() { \
> + if (!superuser()) \
> + ereport(ERROR, \
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> + (errmsg("must be superuser to use verification functions")))); }

Why is this a macro?

> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption. Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation. This may preserve more information about the
> + * nature of the underlying problem. Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION PANIC
> +#define CONCERN WARNING
> +#define POSITION NOTICE
> +#else
> +#define CORRUPTION ERROR
> +#define CONCERN DEBUG1
> +#define POSITION DEBUG2
> +#endif

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.

> +/*
> + * A B-Tree cannot possibly have this many levels, since there must be one
> + * block per level, which is bound by the range of BlockNumber:
> + */
> +#define InvalidBtreeLevel ((uint32) InvalidBlockNumber)

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.

> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> + Oid indrelid = PG_GETARG_OID(0);
> + Oid heapid;
> + Relation indrel;
> + Relation heaprel;
> +
> + CHECK_SUPERUSER();
> +
> + /*
> + * We must lock table before index to avoid deadlocks. However, if the
> + * passed indrelid isn't an index then IndexGetRelation() will fail.
> + * Rather than emitting a not-very-helpful error message, postpone
> + * complaining, expecting that the is-it-an-index test below will fail.
> + */
> + heapid = IndexGetRelation(indrelid, true);
> + if (OidIsValid(heapid))
> + heaprel = heap_open(heapid, ShareLock);
> + else
> + heaprel = NULL;
> +
> + /*
> + * Open the target index relations separately (like relation_openrv(), but
> + * with heap relation locked first to prevent deadlocking). In hot standby
> + * mode this will raise an error.
> + */
> + indrel = index_open(indrelid, ExclusiveLock);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...

> + /* Check for active uses of the index in the current transaction */
> + CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?

> +static void
> +btree_index_checkable(Relation rel)
> +{
> + if (rel->rd_rel->relkind != RELKIND_INDEX ||
> + rel->rd_rel->relam != BTREE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("only nbtree access method indexes are supported"),
> + errdetail("Relation \"%s\" is not a B-Tree index.",
> + RelationGetRelationName(rel))));
> +
> + if (RELATION_IS_OTHER_TEMP(rel))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("cannot access temporary tables of other sessions"),
> + errdetail("Index \"%s\" is associated with temporary relation.",
> + RelationGetRelationName(rel))));
> +
> + if (!rel->rd_index->indisready)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("cannot check index \"%s\"",
> + RelationGetRelationName(rel)),
> + errdetail("Index is not yet ready for insertions")));

Why can't we check this? Seems valuable to allow checking after a
partial CONCURRENTLY index build?

Why don't you use IndexIsReady et al? Not sure if the patch compiles
against an earlier release (there weren't valid/ready/live dead fields,
but instead it was compressed into fewer)?

> +/*
> + * bt_check_every_level()
> + *
> + * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
> + * logical order, verifying invariants as it goes.
> + *
> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
> + * the index relation, and advise us if extra checks are safe when an
> + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent any
> + * kind of physical modification to the index structure, including
> + * modifications that VACUUM may make.
> + */

Hm, what kind of locking does the kill_prior_tuple stuff use?

> +static void
> +bt_check_every_level(Relation rel, bool exclusivelock)
> +{

> + /*
> + * Certain deletion patterns can result in "skinny" B-Tree indexes, where
> + * the fast root and true root differ.
> + *
> + * Start from the true root, not the fast root, unlike conventional index
> + * scans. This approach is more thorough, and removes the risk of
> + * following a stale fast root from the meta page.
> + */
> + if (metad->btm_fastroot != metad->btm_root)
> + ereport(CONCERN,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("fast root mismatch in index %s",
> + RelationGetRelationName(rel)),
> + errdetail_internal("Fast block %u (level %u) "
> + "differs from true root "
> + "block %u (level %u).",
> + metad->btm_fastroot, metad->btm_fastlevel,
> + metad->btm_root, metad->btm_level)));

Hm, isn't that a potentially spurious report?

> +static BtreeLevel
> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
> +{

> + do
> + {
> + /* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
> + CHECK_FOR_INTERRUPTS();
> +
> + /* Initialize state for this iteration */
> + state->targetblock = current;
> + state->target = palloc_btree_page(state, state->targetblock);
> + state->targetlsn = PageGetLSN(state->target);
> +
> + opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +
> + if (P_IGNORE(opaque))
> + {
> + if (P_RIGHTMOST(opaque))
> + ereport(CORRUPTION,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("block %u fell off the end of index \"%s\"",
> + current, RelationGetRelationName(state->rel))));
> + else
> + ereport(CONCERN,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("block %u of index \"%s\" ignored",
> + current, RelationGetRelationName(state->rel))));

Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
of these?

> + /*
> + * Before beginning any non-trivial examination of level, establish
> + * next level down's leftmost block number, which next call here
> + * will pass as its leftmost (iff this isn't leaf level).

"next level down's leftmost block number" - it'd be good to simplify
that a bit.

> +/*
> + * bt_target_page_check()

Why "target"?

> +static void
> +bt_target_page_check(BtreeCheckState *state)
> +{

> + /*
> + * ********************
> + * * Page order check *
> + * ********************

Isn't that item order, rather than page order?

> + }
> + /*
> + * ********************
> + * * Last item check *
> + * ********************

Newline before block wouldn't hurt.

> + * Check last item against next/right page's first data item's when
> + * last item on page is reached.

I think this need some polishing.

> + /*
> + * ********************
> + * * Downlink check *
> + * ********************
> + *
> + * Additional check of child items against target page (their parent),
> + * iff this is internal page and caller holds ExclusiveLock on index

s/is/is an/ /holds/holds an/

> +static ScanKey
> +bt_right_page_check_scankey(BtreeCheckState *state)
> +{

> + * A deleted page won't actually be recycled by VACUUM early enough for us
> + * to fail to be able follow its right link (or left link, or downlink),

"for us to fail to be able" sounds odd.

> + /*
> + * No ExclusiveLock held case -- why it's safe to proceed.
> + *
> + * Problem:
> + *
> + * We must avoid false positive reports of corruption when caller treats
> + * item returned here as an upper bound on target's last item. In general,
> + * false positives are disallowed. Ensuring they don't happen in
> + * !exclusivelock case is subtle.

s/item/an item/?

This paragraph is too complicated to follow in the noisy environment
where we both currently are ;)

> +/*
> + * bt_downlink_check()
> + *
> + * Checks one of target's downlink against its child page.

downlinks?

> + * Conceptually, the target page continues to be what is checked here. The
> + * target block is still blamed in the event of finding an invariant violation.
> + * The downlink insertion into the target is probably where any problem raised
> + * here arises, and there is no such thing as a parent link, so doing the
> + * verification this way around is much more practical.
> + */
> +static void
> +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
> + ScanKey targetkey)
> +{

> + * between the deleted page and its right sibling. (We cannot delete a
> + * parent page's rightmost page unless it is the last child page, and we
> + * intend to delete the parent itself.)

"parent page's rightmost page"?

I like this. Some of the more complex pieces towards the end of the
field need some attention, there's a fair amount of word-smithing
needed, and I do think we want to make the structural changes outlined
above, but besides these, imo fairly simple adaptions, I do think this
is useful and not that far from being committable.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-11-16 21:10:24 Re: Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default
Previous Message Peter Eisentraut 2016-11-16 21:00:37 Re: Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default