Re: amcheck (B-Tree integrity checking tool)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2017-02-09 21:57:08
Message-ID: 20170209215708.i76xtqgtayzmsihv@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:
> > I'd really like to have something like relation_check(), index_check()
> > which calls the correct functions for the relevan index types (and
> > possibly warns about !nbtree indexes not being checked?).
>
> I feel that that's something for an in-core facility that exposes this
> through DDL. One good reason to not do that just yet is that I'm not
> completely comfortable with what a uniform interface looks like here.
> And frankly, I really just want to get this out of the way, to prove
> the general concept of verification tooling. amcheck v1 is a good
> start, but much work remains. I think that no one will take the idea
> completely seriously until it is usable as a core extension. Let's get
> the ball rolling.

I'm not convinced at all by that argument. If we want something in
core, having this in contrib doesn't help much. Nor do I see much
point in explicit syntax.

> >> +#define CHECK_SUPERUSER() { \
> >> + if (!superuser()) \
> >> + ereport(ERROR, \
> >> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> >> + (errmsg("must be superuser to use verification functions")))); }
> >
> > Please make this either a function or just copy - the macro doesn't buy
> > us anything except weirder looking syntax.
>
> It's actually from pageinspect, but okay.

We have a lot of dodgy stuff in places ;)

> >> +PG_FUNCTION_INFO_V1(bt_index_check);
> >> +PG_FUNCTION_INFO_V1(bt_index_parent_check);
> >
> > I think it'd be good to avoid adding a lot of functions for each
> > additional type of check we can think of. I think adding a named 'bool
> > check_parents = false' argument or such would be better. Then we can
> > simply add more and more checks subsequently, and we can have
> > 'bool all_checks = false' so people can trivially check everything,
> > instead of having to add new checks in each new function.
>
> I strongly disagree. I would agree if the lock strength were the same
> in both cases, but it isn't. Varying the strength of heavyweight lock
> taken based on an argument to a function is a massive foot-gun IMV.

Meh. I don't think that's a serious problem, nor is without precedent
(say same toplevel DDL with differing lock levels depending on
options). Nor do the function names confer that any more efficiently
than parameters.

> I do think that that's the logical way to split out functionality. I
> don't think that there is likely to be a problem with adding stuff in
> the future. It'll either be something that we can do in passing
> anyway, in which case we can just add it to the existing checks
> harmlessly. Or, it will be something like a tool that verifies the
> heap is consistent with a given index, which is vastly more expensive
> in terms of CPU and I/O costs, and yet is probably no worse than
> bt_index_check() in terms of lock strength (AccessShareLock).

But you quite possibly want to do all three in one pass - and that'll be
ugly with your API. With your API I'll not be able to do the parent
check and the heap check at once. Lest we add fourth function that does
all of it.

> My thinking here is that users really ought to use bt_index_check() in
> almost all cases, not bt_index_parent_check(), which is more of a tool
> for hackers that are developing new B-Tree features. The distinction
> is blurry, in fact, but the lock strength requirements of
> bt_index_parent_check() make it vastly less useful.

A quite reasonable patter is to do this verification on standbys that
you can either bring up separately and/or just temporarily pause. Or
just backups before declaring them valid.

> >> +/*
> >> + * bt_index_check(index regclass)
> >> + *
> >> + * Verify integrity of B-Tree index.
> >> + *
> >> + * Only acquires AccessShareLock on index relation. Does not consider
> >> + * invariants that exist between parent/child pages.
> >> + */
>
> > I'm *VERY* strongly against releasing locks on user relations early.
>
> Why? pageinspect does this.

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.

> >> + /*
> >> + * We must lock table before index to avoid deadlocks. However, if the
> >> + * passed indrelid isn't an index then IndexGetRelation() will fail.
> >> + * Rather than emitting a not-very-helpful error message, postpone
> >> + * complaining, expecting that the is-it-an-index test below will fail.
> >> + */
> >> + heapid = IndexGetRelation(indrelid, true);
> >> + if (OidIsValid(heapid))
> >> + heaprel = heap_open(heapid, ShareLock);
> >> + else
> >> + heaprel = NULL;
> >> +
> >> + /*
> >> + * Open the target index relations separately (like relation_openrv(), but
> >> + * with heap relation locked first to prevent deadlocking). In hot standby
> >> + * mode this will raise an error.
> >> + */
> >> + indrel = index_open(indrelid, ExclusiveLock);
> >
> > So yea, this is what I was complaining about re locking / index & table
> > relation IIRC. Note you're also deadlock prone by violating the rule
> > about always locking the relation first.
>
> This is based on brin_summarize_new_values().
> I'm not doing a recheck,
> though, because I don't actually need the heap relation for anything
> (I suppose I imagined that that was okay, possibly failing to consider
> some subtlety beyond "consistency is good").

We iirc rely on indexes not being locked without the heap also being
locked... I see no reason to be careful here.

The deadlock part of my complaint is bogus.

> The comments here are almost a direct lift -- is
> brin_summarize_new_values() wrong to itself claim to be
> deadlock-safe?.

No, that seems ok.

> (BTW, brin_summarize_new_values(), an SQL-callable function, also
> releases heavyweight locks early.)

And that's afaics actually broken and a bug.

> >> + if (!IndexIsValid(rel->rd_index))
> >> + ereport(ERROR,
> >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> + errmsg("cannot check index \"%s\"",
> >> + RelationGetRelationName(rel)),
> >> + errdetail("Index is not valid")));
> >
> > I'm perfectly fine with not changing that, but IIUC it'd be fine to
> > weaken this to IndexIsLive() - right?
>
> I believe so, but prefer to be more restrictive in the absence of a
> good reason not to be.

Seems useful to be able to verify an index during CONCURRENTLY, after
the initial build, but before the update pass, it's not like we didn't
have bugs in it :(

> >> +static void
> >> +bt_check_every_level(Relation rel, bool exclusivelock)
> >
> > Hm, is every level accurate - this is more like checking every
> > "reachable" page, right?
>
> Every level should have at least one page that is reachable (not
> half-dead or fully dead), unless perhaps VACUUM marks the last leaf
> page in the entire index as half-dead and we land on that. I think
> that that's such a rare case that it shouldn't alter how we name this
> function at all, which really does do what it says almost always. Note
> that there are pretty severe restrictions on when vacuum can do
> page-level deletion of an internal page. This is what leads to index
> bloat with sparse deletion patterns (one problem I want to address by
> adding suffix truncation to internal pages at a later date).

I think my point was different - it's not that we're not checking every
level, it's that we're checking *more* than just "levels".

> > What's your reasoning for differing sometimes using
> > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?
>
> The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors,
> and do not indicate corruption.

Don't think errcodes make much sense for debug stuff, nor do I think
that ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be a good fit for
something that's not an error.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-02-09 22:09:23 Re: Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Erik Rijkers 2017-02-09 21:43:23 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)