Re: amcheck (B-Tree integrity checking tool)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2017-02-09 08:05:49
Message-ID: 20170209080549.op67jds25w3jvp47@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:
> Andres said: "Theoretically we should recheck that the oids still
> match, theoretically the locking here allows the index->table mapping
> to change". I don't know how to act on this feedback, since comparable
> index + heap locking code should have the same problem, but doesn't
> consider it. How is this any different to reindex_index()?

reindex_index isn't necessarily comparable, because
RangeVarCallbackForReindexIndex() is, on first blush at least, careful
enough about the locking, and reindex_relation only gets the index list
after building the index:
/*
* Find and lock index, and check permissions on table; use callback to
* obtain lock on table first, to avoid deadlock hazard. The lock level
* used here must match the index lock obtained in reindex_index().
*/

I think you ought to do something similar.

I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).

> @@ -0,0 +1,1227 @@
> +/*-------------------------------------------------------------------------
> + *
> + * verify_nbtree.c
> + * Verifies the integrity of nbtree indexes based on invariants.
> + *
> + * For B-Tree indexes, verification includes the invariant that each page in
> + * the target index has items in logical order as reported by an insertion
> + * scankey (the insertion scankey sort-wise NULL semantics are needed for
> + * verification).

I'd appreciate some rephrasing of this paragraph. "includes the
invariant that" isn't that easy to understand for an ESL person.

> + *
> + * Copyright (c) 2016, PostgreSQL Global Development Group

... ;)

> +#define CHECK_SUPERUSER() { \
> + if (!superuser()) \
> + ereport(ERROR, \
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> + (errmsg("must be superuser to use verification functions")))); }

Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.

> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption. Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation. This may preserve more information about the
> + * nature of the underlying problem. Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION PANIC
> +#define CONCERN WARNING
> +#define POSITION NOTICE
> +#else
> +#define CORRUPTION ERROR
> +#define CONCERN DEBUG1
> +#define POSITION DEBUG2
> +#endif

Please remove this block and use the normal elevels.

> +/*
> + * State associated with verifying a B-Tree index
> + */
> +typedef struct BtreeCheckState
> +{
> + /*
> + * Unchanging state, established at start of verification:
> + *
> + * rel: B-Tree Index Relation
> + * exclusivelock: ExclusiveLock held on rel; else AccessShareLock
> + * checkstrategy: Buffer access strategy
> + * targetcontext: Target page memory context
> + */
> + Relation rel;
> + bool exclusivelock;
> + BufferAccessStrategy checkstrategy;
> + MemoryContext targetcontext;

Can you move the field comments to the individual fields? Separating
them makes it more likely to get out of date (obviously keep the
header). I'd also rename 'exclusivelock' to 'exclusivelylocked' or
such.

> + /*
> + * Mutable state, for verification of particular page:
> + *
> + * target: Main target page
> + * targetblock: Main target page's block number
> + * targetlsn: Main target page's LSN

Same here.

> +PG_FUNCTION_INFO_V1(bt_index_check);
> +PG_FUNCTION_INFO_V1(bt_index_parent_check);

I think it'd be good to avoid adding a lot of functions for each
additional type of check we can think of. I think adding a named 'bool
check_parents = false' argument or such would be better. Then we can
simply add more and more checks subsequently, and we can have
'bool all_checks = false' so people can trivially check everything,
instead of having to add new checks in each new function.

> +/*
> + * bt_index_check(index regclass)
> + *
> + * Verify integrity of B-Tree index.
> + *
> + * Only acquires AccessShareLock on index relation. Does not consider
> + * invariants that exist between parent/child pages.
> + */
> +Datum
> +bt_index_check(PG_FUNCTION_ARGS)
> +{
> + Oid relid = PG_GETARG_OID(0);
> + Relation indrel;
> +
> + CHECK_SUPERUSER();
> +
> + indrel = relation_open(relid, AccessShareLock);
>

> + /* Relation suitable for checking as B-Tree? */
> + btree_index_checkable(indrel);
> +
> + /* Check index */
> + bt_check_every_level(indrel, false);
> +
> + relation_close(indrel, AccessShareLock);

I'm *VERY* strongly against releasing locks on user relations early.

> +/*
> + * bt_index_parent_check(index regclass)
> + *
> + * Verify integrity of B-Tree index.
> + *
> + * Acquires ExclusiveLock on index relation, and ShareLock on the associated
> + * heap relation, a bit like REINDEX. Verifies that downlinks in parent pages
> + * are valid lower bounds on child pages.
> + */
> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> + Oid indrelid = PG_GETARG_OID(0);
> + Oid heapid;
> + Relation indrel;
> + Relation heaprel;
> +
> + CHECK_SUPERUSER();
> +
> + /*
> + * We must lock table before index to avoid deadlocks. However, if the
> + * passed indrelid isn't an index then IndexGetRelation() will fail.
> + * Rather than emitting a not-very-helpful error message, postpone
> + * complaining, expecting that the is-it-an-index test below will fail.
> + */
> + heapid = IndexGetRelation(indrelid, true);
> + if (OidIsValid(heapid))
> + heaprel = heap_open(heapid, ShareLock);
> + else
> + heaprel = NULL;
> +
> + /*
> + * Open the target index relations separately (like relation_openrv(), but
> + * with heap relation locked first to prevent deadlocking). In hot standby
> + * mode this will raise an error.
> + */
> + indrel = index_open(indrelid, ExclusiveLock);

So yea, this is what I was complaining about re locking / index & table
relation IIRC. Note you're also deadlock prone by violating the rule
about always locking the relation first.

> + index_close(indrel, ExclusiveLock);
> + if (heaprel)
> + heap_close(heaprel, ShareLock);

Same thing about releasing locks early.

> +/*
> + * btree_index_checkable()
> + *
> + * Basic checks about the suitability of a relation for checking as a B-Tree
> + * index.
> + */
> +static void
> +btree_index_checkable(Relation rel)
> +{
> + if (rel->rd_rel->relkind != RELKIND_INDEX ||
> + rel->rd_rel->relam != BTREE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("only nbtree access method indexes are supported"),
> + errdetail("Relation \"%s\" is not a B-Tree index.",
> + RelationGetRelationName(rel))));

I think this message could stand some rephrasing. There's a significant
overlap between msg and detail, and they use different phrasing for
btree.

> + if (!IndexIsValid(rel->rd_index))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("cannot check index \"%s\"",
> + RelationGetRelationName(rel)),
> + errdetail("Index is not valid")));

I'm perfectly fine with not changing that, but IIUC it'd be fine to
weaken this to IndexIsLive() - right?

> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
> + * the index relation, and advise us if extra checks are safe when an
> + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent any
> + * kind of physical modification to the index structure, including
> + * modifications that VACUUM may make.

It might make sense to be a bit more explicit about the structure bit
(which I assume refers to the fact that modifications to individual
pages are possible (re killitems)?).

> + */
> +static void
> +bt_check_every_level(Relation rel, bool exclusivelock)

Hm, is every level accurate - this is more like checking every
"reachable" page, right?

> + /*
> + * Certain deletion patterns can result in "skinny" B-Tree indexes, where
> + * the fast root and true root differ.
> + *
> + * Start from the true root, not the fast root, unlike conventional index
> + * scans. This approach is more thorough, and removes the risk of
> + * following a stale fast root from the meta page.
> + */
> + if (metad->btm_fastroot != metad->btm_root)
> + ereport(CONCERN,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("fast root mismatch in index %s",
> + RelationGetRelationName(rel)),
> + errdetail_internal("Fast block %u (level %u) "
> + "differs from true root "
> + "block %u (level %u).",
> + metad->btm_fastroot, metad->btm_fastlevel,
> + metad->btm_root, metad->btm_level)));

I'd weaken the message to something that sounds less like an actual
problem... - which this is not, right?

> +/*
> + * bt_check_level_from_leftmost()
> + *
> + * Given a left-most block at some level, move right, verifying each page
> + * individually (with more verification across pages for "exclusivelock"
> + * callers). Caller should pass the true root page as the leftmost initially,
> + * working their way down by passing what is returned for the last call here
> + * until level 0 (leaf page level) was reached.
> + *
> + * Returns state for next call, if any.

Wouldn't "'sets state for the next call..." be more accurate than
returning?

> +static BtreeLevel
> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
> +{

> + if (P_IGNORE(opaque))
> + {
> + if (P_RIGHTMOST(opaque))
> + ereport(CORRUPTION,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("block %u fell off the end of index \"%s\"",
> + current, RelationGetRelationName(state->rel))));

I think this could stand a bit more formal language.

> + else
> + ereport(CONCERN,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("block %u of index \"%s\" ignored",
> + current, RelationGetRelationName(state->rel))));

missing verb. Also, is this actually worth logging (even at such a low level)?

> + /*
> + * Before beginning any non-trivial examination of level, establish
> + * state to be passed by caller for next call here, if this isn't
> + * last call/leaf level.

"state to be passed by caller for next call here" is hard to
understand. "Prepare state for next bt_check_level_from_leftmost
invocation for the next level"?

> + * There should be at least one non-ignorable page per level.
> + */

I might have missed something, but you're not checking that, are you?

> + if (state->exclusivelock && opaque->btpo_prev != leftcurrent)
> + ereport(CORRUPTION,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("right link/left link pair in index \"%s\" not in mutual agreement",
> + RelationGetRelationName(state->rel)),
> + errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
> + current, leftcurrent, opaque->btpo_prev)));

"mutual" doesn't seem to add much here.

> + /* Try to detect circular links */
> + if (current == leftcurrent || current == opaque->btpo_prev)
> + ereport(CORRUPTION,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("circular link chain found in block %u of index \"%s\"",
> + current, RelationGetRelationName(state->rel))));

That's not really reliable though, is it? I've seen corrupted link
chains (leading to endless vacuums and such) a number of times, so maybe
explicitly keeping track of already visited pages within a level would
be worthwhile?

> +/*
> + * bt_target_page_check()
> + *
> + * Function performs the following checks on target page, or pages ancillary to
> + * target page:
> + *
> + * - That every "real" data item is less than or equal to the high key, which
> + * is an upper bound on the items on the pages (where there is a high key at
> + * all -- pages that are rightmost lack one).
> + *
> + * - That within the page, every "real" item is less than or equal to the item
> + * immediately to its right, if any (i.e., that the items are in order within
> + * the page, so that the binary searches performed by index scans are sane).
> + *
> + * - That the last item stored on the page is less than or equal to the first
> + * "real" data item on the page to the right (if such a first item is
> + * available).

Hm. I think it'd be useful if we also made sure (or in the caller) that
the page actually is at the correct level. I seem to recall seing a
case where downlinks were "circular" due to corruption, leading to stuck
scans.

> + * Note: This routine is not especially proactive in freeing memory. High
> + * watermark memory consumption is bound to some small fixed multiple of
> + * BLCKSZ, though. Caller should reset the current context between calls here.

I'd shorten this to "Memory allocated in this routine is expected to be
released by the caller resetting the ->xxx/current context".

> +static void
> +bt_target_page_check(BtreeCheckState *state)
> +{
> + OffsetNumber offset;
> + OffsetNumber max;
> + BTPageOpaque topaque;
> +
> + topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> + max = PageGetMaxOffsetNumber(state->target);
> +
> + elog(POSITION, "verifying %u items on %s block %u", max,
> + P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);

I think it's fine if you don't, but this could theoretically integrate
with the progress stuff added in 9.6 (as in
pgstat_progress_start_command()). That'd allow for a lot more
convenient display of this than thousands of debug messages or nothing.

> + /*
> + * Loop over page items, but don't start from P_HIKEY (don't have iteration
> + * directly considering high key item, if any).

I don't understand the parenthetical comment as phrased.

> + /*
> + * ********************
> + * * High key check *
> + * ********************
> + *
> + * If there is a high key, which there must be for a non-rightmost
> + * page, check that it actually is upper bound on all page items.

"which there must be for" sounds weird.

> + * We prefer to check all items, rather than checking just the first
> + * and trusting that the operator class obeys the transitive law (which
> + * implies that all subsequent items also respected the high key
> + * invariant if they pass the item order check).

s/prefer to//

> + /*
> + * ********************
> + * * Item order check *
> + * ********************
> + *
> + * Check that items are stored on page in logical order, by checking
> + * current item is less than or equal to next item (if any).
> + */

Does that kind of comment look reasonable after pgindent? Actually,
could you just pindent this patch (including typedefs.list additions)
and repair the bad looking damage?

> + if (OffsetNumberNext(offset) <= max &&
> + !invariant_key_less_than_equal_offset(state, skey,
> + OffsetNumberNext(offset)))
> + {
> + char *itid, *htid,
> + *nitid, *nhtid;
> +
> + itid = psprintf("(%u,%u)", state->targetblock, offset);
> + htid = psprintf("(%u,%u)",
> + ItemPointerGetBlockNumber(&(itup->t_tid)),
> + ItemPointerGetOffsetNumber(&(itup->t_tid)));
> + nitid = psprintf("(%u,%u)", state->targetblock,
> + OffsetNumberNext(offset));

Do all these separate psprintfs have much of a point? It seems just as
easy to includ ethem in the actual ereport?

> + * next/sibling page. There may not be such an item available from
> + * sibling for various reasons, though (e.g., target is the rightmost
> + * page on level).
> + */

Is the comma after e.g. correct? </mega-pedanticism-from-esl-person>

> + ereport(CORRUPTION,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("cross page item order invariant violated for index \"%s\"",
> + RelationGetRelationName(state->rel)),
> + errdetail_internal("Last item on page tid=(%u,%u) "
> + "page lsn=%X/%X.",
> + state->targetblock, offset,
> + (uint32) (state->targetlsn >> 32),
> + (uint32) state->targetlsn)));

I'm am *wondering* (as in not sure), if we could get rid of all the "for
index ..." bits and use an errcontext for that instead, which'd be
included in all messages.

> + /*
> + * General notes on concurrent page splits and page deletion:
> + *
> + * Routines like _bt_search() don't require *any* page split interlock when
> + * descending the tree, including something very light like a buffer pin.
> + * That's why it's okay that we don't either. This avoidance of any need
> + * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
> + * algorithm, in fact.

raison d'etre comment seems more like something belonging to
nbtree/README than this, but I don't care much.

> + /*
> + * No ExclusiveLock held case -- why it's safe to proceed.
> + *
> + * Problem:
> + *
> + * We must avoid false positive reports of corruption when caller treats
> + * item returned here as an upper bound on target's last item. In general,
> + * false positives are disallowed. Ensuring they don't happen in
> + * !exclusivelock case is subtle.

Maybe s/Ensuring/The reasons why they don't happen in ...'?

> + * A concurrent page deletion by VACUUM of the target page can result in
> + * the insertion of items on to this right sibling page that would

"on to this right sibling page"?

> + * previously have been inserted on our target page. There might have been
> + * insertions that followed target's downlink after it was made to point to

missing trailing 'the'? I'm really not a native speaker (and thus might
be off the mark), but making these sentences a bit clearer really
wouldn't hurt.

> +static void
> +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
> + ScanKey targetkey)
> +{
> + OffsetNumber offset;
> + OffsetNumber maxoffset;
> + Page child;
> + BTPageOpaque copaque;
> +
> + /*
> + * Caller must have ExclusiveLock on target relation, because of
> + * considerations around page deletion by VACUUM.
> + *
> + * N.B.: In general, page deletion deletes the right sibling's downlink,

s/N.B./NB/ by precedence ;)

> + * We prefer to check all items, rather than checking just the first and
> + * trusting that the operator class obeys the transitive law.
> + */

s/We prefer//

> + child = palloc_btree_page(state, childblock);
> + copaque = (BTPageOpaque) PageGetSpecialPointer(child);
> + maxoffset = PageGetMaxOffsetNumber(child);
> +
> + for (offset = P_FIRSTDATAKEY(copaque);
> + offset <= maxoffset;
> + offset = OffsetNumberNext(offset))
> + {
> + /*
> + * Skip comparison of target page key against "minus infinity" item, if
> + * any. Checking it would indicate that it's not an upper bound, but
> + * that's only because of the hard-coding within _bt_compare().
> + */
> + if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
> + continue;

FWIW, I'd even convert this to an inline function rather than a macro.
Macros are useful when you have to do stuff you can't do otherwise, but
shouldn't be used much otherwise.

> + if (!invariant_key_less_than_equal_nontarget_offset(state, child,
> + targetkey, offset))
> + ereport(CORRUPTION,
> + (errcode(ERRCODE_INDEX_CORRUPTED),

What's your reasoning for differing sometimes using
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?

> + errmsg("down-link lower bound invaariant violated for index \"%s\"",

s/invaariant/invariant/

> +/*
> + * invariant_key_greater_than_equal_offset()

I'd personally just remove the function name in these, but that's personal
taste and i'll leave you the decision.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2017-02-09 09:45:56 Re: pg_bsd_indent: implement -lps ("leave preprocessor space")
Previous Message Michael Paquier 2017-02-09 07:33:05 Re: SCRAM authentication, take three