| From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> | 
|---|---|
| To: | Peter Geoghegan <pg(at)bowt(dot)ie> | 
| Cc: | Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru> | 
| Subject: | Re: Making all nbtree entries unique by having heap TIDs participate in comparisons | 
| Date: | 2019-03-16 16:17:21 | 
| Message-ID: | 55d61cee-a13b-5623-737b-5d06774a9c3a@iki.fi | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 16/03/2019 10:51, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> It would be nice if you could take a look at the amcheck "relocate"
>>> patch
>> When I started looking at this, I thought that "relocate" means "move".
>> So I thought that the new mode would actually move tuples, i.e. that it
>> would modify the index. That sounded horrible. Of course, it doesn't
>> actually do that. It just checks that each tuple can be re-found, or
>> "relocated", by descending the tree from the root. I'd suggest changing
>> the language to avoid that confusion.
> 
> Okay. What do you suggest? :-)
Hmm. "re-find", maybe? We use that term in a few error messages already, 
to mean something similar.
>> It seems like a nice way to catch all kinds of index corruption issues.
>> Although, we already check that the tuples are in order within the page.
>> Is it really necessary to traverse the tree for every tuple, as well?
>> Maybe do it just for the first and last item?
> 
> It's mainly intended as a developer option. I want it to be possible
> to detect any form of corruption, however unlikely. It's an
> adversarial mindset that will at least make me less nervous about the
> patch.
Fair enough.
At first, I thought this would be horrendously expensive, but thinking 
about it a bit more, nearby tuples will always follow the same path 
through the upper nodes, so it'll all be cached. So maybe it's not quite 
so bad.
>> I don't understand this. Can you give an example of this kind of
>> inconsistency?
> 
> The commit message gives an example, but I suggest trying it out for
> yourself. Corrupt the least significant key byte of a root page of a
> B-Tree using pg_hexedit. Say it's an index on a text column, then
> you'd corrupt the last byte to be something slightly wrong. Then, the
> only way to catch it is with "relocate" verification. You'll only miss
> a few tuples on a cousin page at the leaf level (those on either side
> of the high key that the corrupted separator key in the root was
> originally copied from).
>
> The regular checks won't catch this, because the keys are similar
> enough one level down. The "minus infinity" item is a kind of a blind
> spot, because we cannot do a parent check of its children, because we
> don't have the key (it's truncated when the item becomes a right page
> minus infinity item, during an internal page split).
Hmm. So, the initial situation would be something like this:
                  +-----------+
                  | 1: root   |
                  |           |
                  | -inf -> 2 |
                  | 20   -> 3 |
                  |           |
                  +-----------+
         +-------------+ +-------------+
         | 2: internal | | 3: internal |
         |             | |             |
         | -inf -> 4   | | -inf -> 6   |
         | 10   -> 5   | | 30   -> 7   |
         |             | |             |
         | hi: 20      | |             |
         +-------------+ +-------------+
+---------+ +---------+ +---------+ +---------+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
|         | |         | |         | |         |
| 1       | | 11      | | 21      | | 31      |
| 5       | | 15      | | 25      | | 35      |
| 9       | | 19      | | 29      | | 39      |
|         | |         | |         | |         |
| hi: 10  | | hi: 20  | | hi: 30  | |         |
+---------+ +---------+ +---------+ +---------+
Then, a cosmic ray changes the 20 on the root page to 18. That causes 
the the leaf tuple 19 to become non-re-findable; if you descend the for 
19, you'll incorrectly land on page 6. But it also causes the high key 
on page 2 to be different from the downlink on the root page. Wouldn't 
the existing checks catch this?
- Heikki
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2019-03-16 16:20:01 | Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view? | 
| Previous Message | Dmitry Dolgov | 2019-03-16 16:14:20 | Re: Index Skip Scan |