Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-16 16:17:21
Message-ID: 55d61cee-a13b-5623-737b-5d06774a9c3a@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16/03/2019 10:51, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> It would be nice if you could take a look at the amcheck "relocate"
>>> patch
>> When I started looking at this, I thought that "relocate" means "move".
>> So I thought that the new mode would actually move tuples, i.e. that it
>> would modify the index. That sounded horrible. Of course, it doesn't
>> actually do that. It just checks that each tuple can be re-found, or
>> "relocated", by descending the tree from the root. I'd suggest changing
>> the language to avoid that confusion.
>
> Okay. What do you suggest? :-)

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.

>> It seems like a nice way to catch all kinds of index corruption issues.
>> Although, we already check that the tuples are in order within the page.
>> Is it really necessary to traverse the tree for every tuple, as well?
>> Maybe do it just for the first and last item?
>
> It's mainly intended as a developer option. I want it to be possible
> to detect any form of corruption, however unlikely. It's an
> adversarial mindset that will at least make me less nervous about the
> patch.

Fair enough.

At first, I thought this would be horrendously expensive, but thinking
about it a bit more, nearby tuples will always follow the same path
through the upper nodes, so it'll all be cached. So maybe it's not quite
so bad.

>> I don't understand this. Can you give an example of this kind of
>> inconsistency?
>
> The commit message gives an example, but I suggest trying it out for
> yourself. Corrupt the least significant key byte of a root page of a
> B-Tree using pg_hexedit. Say it's an index on a text column, then
> you'd corrupt the last byte to be something slightly wrong. Then, the
> only way to catch it is with "relocate" verification. You'll only miss
> a few tuples on a cousin page at the leaf level (those on either side
> of the high key that the corrupted separator key in the root was
> originally copied from).
>
> The regular checks won't catch this, because the keys are similar
> enough one level down. The "minus infinity" item is a kind of a blind
> spot, because we cannot do a parent check of its children, because we
> don't have the key (it's truncated when the item becomes a right page
> minus infinity item, during an internal page split).

Hmm. So, the initial situation would be something like this:

+-----------+
| 1: root |
| |
| -inf -> 2 |
| 20 -> 3 |
| |
+-----------+

+-------------+ +-------------+
| 2: internal | | 3: internal |
| | | |
| -inf -> 4 | | -inf -> 6 |
| 10 -> 5 | | 30 -> 7 |
| | | |
| hi: 20 | | |
+-------------+ +-------------+

+---------+ +---------+ +---------+ +---------+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
| | | | | | | |
| 1 | | 11 | | 21 | | 31 |
| 5 | | 15 | | 25 | | 35 |
| 9 | | 19 | | 29 | | 39 |
| | | | | | | |
| hi: 10 | | hi: 20 | | hi: 30 | | |
+---------+ +---------+ +---------+ +---------+

Then, a cosmic ray changes the 20 on the root page to 18. That causes
the the leaf tuple 19 to become non-re-findable; if you descend the for
19, you'll incorrectly land on page 6. But it also causes the high key
on page 2 to be different from the downlink on the root page. Wouldn't
the existing checks catch this?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-16 16:20:01 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Previous Message Dmitry Dolgov 2019-03-16 16:14:20 Re: Index Skip Scan