Re: Corrupted btree index on HEAD because of covering indexes

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: Corrupted btree index on HEAD because of covering indexes
Date: 2018-04-19 18:39:05
Message-ID: CAH2-Wzm5brT8oHnWKda-KWV8LeA+u-B3jCkVjZddqW5BqO0=MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 19, 2018 at 9:42 AM, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
> Interesting, contrib/amcheck doesn't find any error in index. Seems, it's
> subject for further improvement.

I think that you're right that this should be detectable by
bt_index_parent_check(). I have an idea about how we can make this
happen with low overhead:

Iff we're readonly (that is, called through bt_index_parent_check()),
use a bloom filter to fingerprint downlink block numbers on each
non-leaf level, starting from the root. On the next level down, check
that we encountered a downlink at the parent level for non-P_IGNORE()
pages. If we didn't, throw an error. This should only need a small
Bloom filter to work well. I suppose we could only do it when
"heapallindexed = true" was specified to bt_index_parent_check(), and
size the Bloom filter based on the same principle as heapallindexed
verification.

This check is correct because the downlink in the parent is atomically
removed when the page (that the downlink points to) is marked
half-dead by VACUUM (and because there cannot be a concurrent VACUUM,
since we have a ShareLock on the relation). If we find a
non-P_IGNORE() page whose block number was not fingerprinted one level
up, then something must be wrong. (We should also throw an error
within bt_downlink_check() if the page turns out to be P_IGNORE(), to
make our testing "symmetrical".)

The really scary thing about corruption like this is not that it leads
to wrong answers to queries. It's that it *doesn't* lead to wrong
answers to queries (I mentioned this issue to Alexander a few weeks
ago, actually). This is true because we'll move right one level down,
maybe more than once, because it just looks like there was a
concurrent page split to index scans.

(There is actually another subtlety about this new
bt_index_parent_check() check, which is legitimate incomplete page
splits that we detect using P_INCOMPLETE_SPLIT(). The proposed check
still seems totally doable, though.)

I would like to go and implement this check now, and if all goes well
I may propose that you commit it to the master branch for v11. I don't
see this as a new feature. I see it as essential testing
infrastructure. What do you think about adding this new check soon?

> Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated
> high key IndexTuple as a storage of blocknumber of top parent to remove. And
> sets BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.
>
> But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
> with ItemPointerIsValid macro - and it returns false, because offset is
> actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
> And some wrong decisions are follows, I didn't look at that.

I'm not at all surprised. I strongly suspected that it was some simple
issue with the representation, since the INCLUDE patch didn't actually
change the page deletion algorithm in any way.

> Trivial and naive fix is attached, but for me it looks a bit annoing that we
> store pointer (leafhikey) somewhere inside unlocked page.

Well, it has worked that way for a long time. No reason to change it now.

I also think that we could have better conventional regression test
coverage here.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2018-04-19 18:39:58 Re: VM map freeze corruption
Previous Message Stephen Frost 2018-04-19 18:15:48 Re: Built-in connection pooling