Re: amcheck (B-Tree integrity checking tool)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Kevin Grittner <kgrittn(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2016-11-23 22:47:09
Message-ID: CAEepm=11qRcCEmjzzQCvit1FEo_8Dm_xZRa2LuWB0iF03QE1bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 18, 2016 at 12:17 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> So, the Btree you show is supposed to be good? You've left it up to me
> to draw T2, the corrupt version? If so, here is a drawing of T2 for
> the sake of clarity:
>
> Btree = [a|b]
> / \
> / \
> / \
> [a]-------[b]
> | |
> | |
> [c]-------[d]

Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted. Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

>> Now c and b have been switched around. Won't amcheck still pass since
>> we only verify immediate downlinks and siblings? Yet searches for b
>> will take a wrong turn at the root. Do I have this right? I wonder
>> if there is a way to verify that each page's high key < the 'next' key
>> for all ancestors.
>
> I don't think you have it right, because your argument is predicated
> on classic L&Y, not the Postgres variant. The high key is literally
> guaranteed to be *equal* to the initial first value of the right half
> of a split page post-split (at least initially), which is where the
> new downlink for the new right half comes from in turn, so it will be
> exactly equal as well. There is a comment which says all this within
> _bt_insert_parent(), actually.
>
> In general, I think it's pretty bad that we're so loose about
> representing the key space in downlinks compared to classic L&Y,
> because it probably makes index bloat a lot worse in internal pages,
> messing up the key space long term. As long as we're doing that,
> though, we clearly cannot usefully "verify that each page's high key <
> the 'next' key for all ancestors", because we deviate from L&Y in a
> way that makes that check always fail. It's okay that it always fails
> for us (it doesn't actually indicate corruption), because of our
> duplicate-aware index scan logic (change to invariant).
>
> In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
> cousin problem a non-issue, because we don't follow downlinks that are
> equal to our scankey -- we go to their immediate left and start
> *there*, in order to support duplicates in leaf pages. Index scans for
> 'b' work for both T1 and T2 in Postgres, because we refuse to follow
> the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
> would actually like to change things to make the invariant the classic
> L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
> suffix truncation in internal pages work.
>
> So, we don't have the cousin problem, but since I wish for us to adopt
> the stricter classic L&Y invariant at some point, you could say that I
> also wished that we had the cousin problem.

Thank you for your patient explanation! Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached. amcheck does detect the violation of the high key
invariant.

> Confused yet? :-)

Yes. But it's getting better slowly :-)

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
trying-to-break-amcheck.txt text/plain 4.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-11-23 23:13:30 Re: patch: function xmltable
Previous Message Thomas Munro 2016-11-23 22:18:01 Re: UNDO and in-place update