Re: amcheck (B-Tree integrity checking tool)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2017-02-09 21:29:42
Message-ID: CAH2-Wzm+E4NaxQ6hXQ-VfZiy9z7Z4h1E75vMXcQSBAW96aue4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 9, 2017 at 12:05 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> reindex_index isn't necessarily comparable, because
> RangeVarCallbackForReindexIndex() is, on first blush at least, careful
> enough about the locking, and reindex_relation only gets the index list
> after building the index:
> /*
> * Find and lock index, and check permissions on table; use callback to
> * obtain lock on table first, to avoid deadlock hazard. The lock level
> * used here must match the index lock obtained in reindex_index().
> */
>
> I think you ought to do something similar.

Okay. Looking into it, but see my response to your later remarks below.

> I'd really like to have something like relation_check(), index_check()
> which calls the correct functions for the relevan index types (and
> possibly warns about !nbtree indexes not being checked?).

I feel that that's something for an in-core facility that exposes this
through DDL. One good reason to not do that just yet is that I'm not
completely comfortable with what a uniform interface looks like here.
And frankly, I really just want to get this out of the way, to prove
the general concept of verification tooling. amcheck v1 is a good
start, but much work remains. I think that no one will take the idea
completely seriously until it is usable as a core extension. Let's get
the ball rolling.

>> +#define CHECK_SUPERUSER() { \
>> + if (!superuser()) \
>> + ereport(ERROR, \
>> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> + (errmsg("must be superuser to use verification functions")))); }
>
> Please make this either a function or just copy - the macro doesn't buy
> us anything except weirder looking syntax.

It's actually from pageinspect, but okay.

>> +#define CORRUPTION ERROR
>> +#define CONCERN DEBUG1
>> +#define POSITION DEBUG2
>> +#endif
>
> Please remove this block and use the normal elevels.

Okay.

>> +/*
>> + * State associated with verifying a B-Tree index
>> + */
>> +typedef struct BtreeCheckState
>> +{
>> + /*
>> + * Unchanging state, established at start of verification:
>> + *
>> + * rel: B-Tree Index Relation
>> + * exclusivelock: ExclusiveLock held on rel; else AccessShareLock
>> + * checkstrategy: Buffer access strategy
>> + * targetcontext: Target page memory context
>> + */
>> + Relation rel;
>> + bool exclusivelock;
>> + BufferAccessStrategy checkstrategy;
>> + MemoryContext targetcontext;
>
> Can you move the field comments to the individual fields? Separating
> them makes it more likely to get out of date (obviously keep the
> header). I'd also rename 'exclusivelock' to 'exclusivelylocked' or
> such.

Okay.

>> + /*
>> + * Mutable state, for verification of particular page:
>> + *
>> + * target: Main target page
>> + * targetblock: Main target page's block number
>> + * targetlsn: Main target page's LSN
>
> Same here.

Okay.

>> +PG_FUNCTION_INFO_V1(bt_index_check);
>> +PG_FUNCTION_INFO_V1(bt_index_parent_check);
>
> I think it'd be good to avoid adding a lot of functions for each
> additional type of check we can think of. I think adding a named 'bool
> check_parents = false' argument or such would be better. Then we can
> simply add more and more checks subsequently, and we can have
> 'bool all_checks = false' so people can trivially check everything,
> instead of having to add new checks in each new function.

I strongly disagree. I would agree if the lock strength were the same
in both cases, but it isn't. Varying the strength of heavyweight lock
taken based on an argument to a function is a massive foot-gun IMV. I
do think that that's the logical way to split out functionality. I
don't think that there is likely to be a problem with adding stuff in
the future. It'll either be something that we can do in passing
anyway, in which case we can just add it to the existing checks
harmlessly. Or, it will be something like a tool that verifies the
heap is consistent with a given index, which is vastly more expensive
in terms of CPU and I/O costs, and yet is probably no worse than
bt_index_check() in terms of lock strength (AccessShareLock).

My thinking here is that users really ought to use bt_index_check() in
almost all cases, not bt_index_parent_check(), which is more of a tool
for hackers that are developing new B-Tree features. The distinction
is blurry, in fact, but the lock strength requirements of
bt_index_parent_check() make it vastly less useful. Especially
considering that it seems fairly unlikely that it will ever catch a
problem that bt_index_check() would not have. bt_index_parent_check()
will become a lot more important when we implement suffix truncation
in internal pages. (I have a very rough patch that does this, actually
-- it passes the regression tests, and verification by amcheck, but
isn't anywhere near ready to post.)

>> +/*
>> + * bt_index_check(index regclass)
>> + *
>> + * Verify integrity of B-Tree index.
>> + *
>> + * Only acquires AccessShareLock on index relation. Does not consider
>> + * invariants that exist between parent/child pages.
>> + */

> I'm *VERY* strongly against releasing locks on user relations early.

Why? pageinspect does this. I think it could be really handy for users
that want to write SQL queries when using bt_index_parent_check().
It's a bit weird, in the sense that it's a special exception that has
to be explained sometimes, but I think it is worth it for something
like this.

>> + /*
>> + * We must lock table before index to avoid deadlocks. However, if the
>> + * passed indrelid isn't an index then IndexGetRelation() will fail.
>> + * Rather than emitting a not-very-helpful error message, postpone
>> + * complaining, expecting that the is-it-an-index test below will fail.
>> + */
>> + heapid = IndexGetRelation(indrelid, true);
>> + if (OidIsValid(heapid))
>> + heaprel = heap_open(heapid, ShareLock);
>> + else
>> + heaprel = NULL;
>> +
>> + /*
>> + * Open the target index relations separately (like relation_openrv(), but
>> + * with heap relation locked first to prevent deadlocking). In hot standby
>> + * mode this will raise an error.
>> + */
>> + indrel = index_open(indrelid, ExclusiveLock);
>
> So yea, this is what I was complaining about re locking / index & table
> relation IIRC. Note you're also deadlock prone by violating the rule
> about always locking the relation first.

This is based on brin_summarize_new_values(). I'm not doing a recheck,
though, because I don't actually need the heap relation for anything
(I suppose I imagined that that was okay, possibly failing to consider
some subtlety beyond "consistency is good"). If I added a recheck for
the heap relation in the style of brin_summarize_new_values(), would
you consider this item closed, and all heavyweight lock concerns
addressed?

The comments here are almost a direct lift -- is
brin_summarize_new_values() wrong to itself claim to be
deadlock-safe?.

(BTW, brin_summarize_new_values(), an SQL-callable function, also
releases heavyweight locks early.)

>> +/*
>> + * btree_index_checkable()
>> + *
>> + * Basic checks about the suitability of a relation for checking as a B-Tree
>> + * index.
>> + */
>> +static void
>> +btree_index_checkable(Relation rel)
>> +{
>> + if (rel->rd_rel->relkind != RELKIND_INDEX ||
>> + rel->rd_rel->relam != BTREE_AM_OID)
>> + ereport(ERROR,
>> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> + errmsg("only nbtree access method indexes are supported"),
>> + errdetail("Relation \"%s\" is not a B-Tree index.",
>> + RelationGetRelationName(rel))));
>
> I think this message could stand some rephrasing. There's a significant
> overlap between msg and detail, and they use different phrasing for
> btree.

Okay.

>> + if (!IndexIsValid(rel->rd_index))
>> + ereport(ERROR,
>> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> + errmsg("cannot check index \"%s\"",
>> + RelationGetRelationName(rel)),
>> + errdetail("Index is not valid")));
>
> I'm perfectly fine with not changing that, but IIUC it'd be fine to
> weaken this to IndexIsLive() - right?

I believe so, but prefer to be more restrictive in the absence of a
good reason not to be.

>> + * It is the caller's responsibility to acquire appropriate heavyweight lock on
>> + * the index relation, and advise us if extra checks are safe when an
>> + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent any
>> + * kind of physical modification to the index structure, including
>> + * modifications that VACUUM may make.
>
> It might make sense to be a bit more explicit about the structure bit
> (which I assume refers to the fact that modifications to individual
> pages are possible (re killitems)?).

It's not really about killitems, which I explained won't matter here,
but I will add a bit about that.

>> + */
>> +static void
>> +bt_check_every_level(Relation rel, bool exclusivelock)
>
> Hm, is every level accurate - this is more like checking every
> "reachable" page, right?

Every level should have at least one page that is reachable (not
half-dead or fully dead), unless perhaps VACUUM marks the last leaf
page in the entire index as half-dead and we land on that. I think
that that's such a rare case that it shouldn't alter how we name this
function at all, which really does do what it says almost always. Note
that there are pretty severe restrictions on when vacuum can do
page-level deletion of an internal page. This is what leads to index
bloat with sparse deletion patterns (one problem I want to address by
adding suffix truncation to internal pages at a later date).

>> + /*
>> + * Certain deletion patterns can result in "skinny" B-Tree indexes, where
>> + * the fast root and true root differ.
>> + *
>> + * Start from the true root, not the fast root, unlike conventional index
>> + * scans. This approach is more thorough, and removes the risk of
>> + * following a stale fast root from the meta page.
>> + */
>> + if (metad->btm_fastroot != metad->btm_root)
>> + ereport(CONCERN,
>> + (errcode(ERRCODE_DUPLICATE_OBJECT),
>> + errmsg("fast root mismatch in index %s",
>> + RelationGetRelationName(rel)),
>> + errdetail_internal("Fast block %u (level %u) "
>> + "differs from true root "
>> + "block %u (level %u).",
>> + metad->btm_fastroot, metad->btm_fastlevel,
>> + metad->btm_root, metad->btm_level)));
>
> I'd weaken the message to something that sounds less like an actual
> problem... - which this is not, right?

No, it isn't a real problem. I'll do something about it.

>> +/*
>> + * bt_check_level_from_leftmost()
>> + *
>> + * Given a left-most block at some level, move right, verifying each page
>> + * individually (with more verification across pages for "exclusivelock"
>> + * callers). Caller should pass the true root page as the leftmost initially,
>> + * working their way down by passing what is returned for the last call here
>> + * until level 0 (leaf page level) was reached.
>> + *
>> + * Returns state for next call, if any.
>
> Wouldn't "'sets state for the next call..." be more accurate than
> returning?

Okay.

>> +static BtreeLevel
>> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
>> +{
>
>> + if (P_IGNORE(opaque))
>> + {
>> + if (P_RIGHTMOST(opaque))
>> + ereport(CORRUPTION,
>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>> + errmsg("block %u fell off the end of index \"%s\"",
>> + current, RelationGetRelationName(state->rel))));
>
> I think this could stand a bit more formal language.

I disagree, because it's a direct lift from nbtree itself.

>> + else
>> + ereport(CONCERN,
>> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> + errmsg("block %u of index \"%s\" ignored",
>> + current, RelationGetRelationName(state->rel))));
>
> missing verb. Also, is this actually worth logging (even at such a low level)?

I'm not following -- "ignore" is a verb.

It may not be worth logging. I just don't know, and prefer to err on
the side of including it.

>> + /*
>> + * Before beginning any non-trivial examination of level, establish
>> + * state to be passed by caller for next call here, if this isn't
>> + * last call/leaf level.
>
> "state to be passed by caller for next call here" is hard to
> understand. "Prepare state for next bt_check_level_from_leftmost
> invocation for the next level"?

Okay.

>> + * There should be at least one non-ignorable page per level.
>> + */
>
> I might have missed something, but you're not checking that, are you?

No. As I said just now, I think it might not be true iff every single
item (every item at the leaf level) has been deleted and is being
vacuumed. If there was even one live item in one leaf page, it would
be guaranteed to have a live parent, which would itself be guaranteed
to if it wasn't the true root, and so on, until the true root. It's
really hard to "lose" a level by having the true root and fast root
differ, in general (of course, we can never truly lose a level -- the
pages can never be reclaimed entirely from a deleted level's last
page). I guess that this fast root stuff was added only for the
benefit of cases where there is continuous bulk deletion and
insertion.

> "mutual" doesn't seem to add much here.

Okay.

>> + /* Try to detect circular links */
>> + if (current == leftcurrent || current == opaque->btpo_prev)
>> + ereport(CORRUPTION,
>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>> + errmsg("circular link chain found in block %u of index \"%s\"",
>> + current, RelationGetRelationName(state->rel))));
>
> That's not really reliable though, is it? I've seen corrupted link
> chains (leading to endless vacuums and such) a number of times, so maybe
> explicitly keeping track of already visited pages within a level would
> be worthwhile?

It's not reliable -- it's one of the "shot in the dark, but why not?"
checks, which are the majority -- no reason to not be defensive.
However, it's highly likely that cases like the one you describe would
be caught by the cross-page check, or even high-key check, because
they're probably explained by an underlying problem with transposed
pages, maybe not even 8KiB aligned pages (e.g., buggy firmware
controllers have been known to do this with their 4KiB pages, but you
wouldn't necessarily notice that easily with an int4 index or
similar).

I feel relatively confident that we don't need to do better with that
one, since it will never make a difference; the only case where that
might be wrong that I can think of is where you have an awful lot of
duplicates, which is unfortunately not that uncommon with Postgres.
Even if it could help, that sounds like something that won't pay for
itself.

> Hm. I think it'd be useful if we also made sure (or in the caller) that
> the page actually is at the correct level. I seem to recall seing a
> case where downlinks were "circular" due to corruption, leading to stuck
> scans.

That's a good idea, because it is, if nothing else, an easy,
low-overhead additional check.

>> + * Note: This routine is not especially proactive in freeing memory. High
>> + * watermark memory consumption is bound to some small fixed multiple of
>> + * BLCKSZ, though. Caller should reset the current context between calls here.
>
> I'd shorten this to "Memory allocated in this routine is expected to be
> released by the caller resetting the ->xxx/current context".

Okay.

>> +static void
>> +bt_target_page_check(BtreeCheckState *state)
>> +{
>> + OffsetNumber offset;
>> + OffsetNumber max;
>> + BTPageOpaque topaque;
>> +
>> + topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
>> + max = PageGetMaxOffsetNumber(state->target);
>> +
>> + elog(POSITION, "verifying %u items on %s block %u", max,
>> + P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);
>
> I think it's fine if you don't, but this could theoretically integrate
> with the progress stuff added in 9.6 (as in
> pgstat_progress_start_command()). That'd allow for a lot more
> convenient display of this than thousands of debug messages or nothing.

Never enough time. :-)

>> + /*
>> + * Loop over page items, but don't start from P_HIKEY (don't have iteration
>> + * directly considering high key item, if any).
>
> I don't understand the parenthetical comment as phrased.

It means that the high key item, which is physically the first item on
the leaf level, and the second item on internal levels. (That have an
explicit "minus infinity" downlink item), is skipped in the sense that
it doesn't get its own iteration of the loop (note that there is only
a high key on non-rightmost-at-level pages -- I like to say that the
rightmost pages have an implicit/imaginary "positive infinity" high
key, to illustrate the symmetry between high keys and downlinks.
Although, we don't say that in any code.)

> "which there must be for" sounds weird.

> s/prefer to//

Okay.

> Does that kind of comment look reasonable after pgindent? Actually,
> could you just pindent this patch (including typedefs.list additions)
> and repair the bad looking damage?

I'll check.

> Do all these separate psprintfs have much of a point? It seems just as
> easy to includ ethem in the actual ereport?

I thought it looked very convoluted that way.

>> + * next/sibling page. There may not be such an item available from
>> + * sibling for various reasons, though (e.g., target is the rightmost
>> + * page on level).
>> + */
>
> Is the comma after e.g. correct? </mega-pedanticism-from-esl-person>

This is a matter of style. "e.g.," is generally considered American style.

>> + ereport(CORRUPTION,
>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>> + errmsg("cross page item order invariant violated for index \"%s\"",
>> + RelationGetRelationName(state->rel)),
>> + errdetail_internal("Last item on page tid=(%u,%u) "
>> + "page lsn=%X/%X.",
>> + state->targetblock, offset,
>> + (uint32) (state->targetlsn >> 32),
>> + (uint32) state->targetlsn)));
>
> I'm am *wondering* (as in not sure), if we could get rid of all the "for
> index ..." bits and use an errcontext for that instead, which'd be
> included in all messages.

Doesn't seem worth it to me.

>> + /*
>> + * General notes on concurrent page splits and page deletion:
>> + *
>> + * Routines like _bt_search() don't require *any* page split interlock when
>> + * descending the tree, including something very light like a buffer pin.
>> + * That's why it's okay that we don't either. This avoidance of any need
>> + * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
>> + * algorithm, in fact.
>
> raison d'etre comment seems more like something belonging to
> nbtree/README than this, but I don't care much.

I think it adds value. This is a learning tool for nbtree. That's a
secondary goal, at least. We're *way* too shy at pointing out simple
facts like this, that are of very significant consequence. It's only
one sentence.

> Maybe s/Ensuring/The reasons why they don't happen in ...'?

> "on to this right sibling page"?

> missing trailing 'the'? I'm really not a native speaker (and thus might
> be off the mark), but making these sentences a bit clearer really
> wouldn't hurt.

Okay. The sentences are very complicated, but that's justified by the
complexity of the underlying issue. I rewrote that text several times.

> s/N.B./NB/ by precedence ;)

> s/We prefer//

Okay.

>> + if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
>> + continue;
>
> FWIW, I'd even convert this to an inline function rather than a macro.
> Macros are useful when you have to do stuff you can't do otherwise, but
> shouldn't be used much otherwise.

Okay.

>> + if (!invariant_key_less_than_equal_nontarget_offset(state, child,
>> + targetkey, offset))
>> + ereport(CORRUPTION,
>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>
> What's your reasoning for differing sometimes using
> ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?

The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors,
and do not indicate corruption.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-02-09 21:37:32 Re: PATCH: two slab-like memory allocators
Previous Message Tom Lane 2017-02-09 21:24:49 Re: [PATCH] Rename pg_switch_xlog to pg_switch_wal