Re: Improve search for missing parent downlinks in amcheck

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve search for missing parent downlinks in amcheck
Date: 2020-02-18 22:16:36
Message-ID: CAH2-WzkAGYHkpELXSfpDaj2qEr4pX0pY6VbeGQujM4BYBWPgcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Great, thank you very much!

No problem!

My remarks here are based on
"amcheck-btree-improve-missing-parent-downlinks-check-6.patch". I have
found a false positive corruption report bug in this latest version --
see note below about incomplete page splits.

> > Don't need the "!P_ISLEAF()" here.
>
> Why don't I need. bt_downlink_connectivity_check() checks one level
> down to the target level. But there is no one level down to leaf...

Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe
it's better your way, though -- apparently it's clearer.

> > > Alternatively
> > > + * we might find child with high key while traversing from
> > > + * previous downlink to current one. Then matching key resides
> > > + * the same offset number as current downlink.
> > > + */
> >
> > Not sure what "traversing from previous downlink to current one" means at all.
>
> I've rephrased this comment, please check.

> Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It
> becomes even simpler now, thanks!

There was actually an even better reason to invent
bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs
to do something like the extra steps that you see in routines like
invariant_l_offset(). _bt_compare() will return 0 when the insertion
scankey has a prefix of scankey/column values that are equal, even
though there may be additional columns in the index tuple that are not
compared. So, you could have a truncated multi-column high key that is
"equal" to pivot tuple in parent that is actually to the right in the
key space. This blind spot would often occur with low cardinality
indexes, where we often have something like this in pivot tuples on
internal pages:

'foo, -inf'
'foo, (1,24)'
'food, -inf'. <-- This pivot tuple's downlink points to the final leaf
page that's filled with duplicates of the value 'foo'
'food, (3,19)' <-- This pivot tuple's downlink points to the *first*
leaf page that's filled with duplicates of the value 'food'
...

The situation is really common in low cardinality indexes because
nbtsplitloc.c hates splitting a leaf page between two duplicates -- it
is avoided whenever possible. You reliably get a '-inf' value for the
TID in the first pivot tuple for the duplicate, followed by a real
heap TID for later pivot tuples for pages with the same duplicate
value.

(Anyway, it's not important now.)

> > I think bt_downlink_check() and bt_downlink_connectivity_check()
> > should be renamed to something broader. In my mind, downlink is
> > basically a block number. We have been sloppy about using the term
> > downlink when we really mean "pivot tuple with a downlink" -- I am
> > guilty of this myself. But it seems more important, now that you have
> > the new high key check.
>
> Hmm... Names are hard for me. I didn't do any renaming for now. What
> about this?
> bt_downlink_check() => bt_child_check()
> bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()

I suggest:

bt_downlink_check() => bt_child_check()
bt_downlink_connectivity_check() => bt_child_highkey_check()

While bt_downlink_connectivity_check() moves right on the target's
child level, this isn't the common case. Moving right like that
doesn't need to be suggested by the name of the function.

Most of the time, we just check the high key -- right?

> I've revised finding the matching pivot key for high key. Now, code
> assumes we should always find a matching pivot key. It could use both
> target high key or left sibling high key (which is memorized as "low
> key").
>
> I've checked this on normal indexes, but I didn't try to exercise this
> with broken indexes. I would appreciate if you do.

I can confirm that checking the high key in target's child page
against the high key in target (when appropriate) removes the "cousin
page verification blind spot" that I noticed in the last version, as
expected. Great!

* You should say "target page lsn" here instead:

pg(at)tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true);
ERROR: mismatch between parent key and child high key in index "i_holding2"
DETAIL: Target block=1570 child block=1690 parent page lsn=0/0.
Time: 12.509 ms

* Maybe say "Move to the right on the child level" in a comment above
the bt_downlink_connectivity_check() "while (true)" loop here:

> +
> + while (true)
> + {
> + /*
> + * Did we traverse the whole tree level and this is check for pages to
> + * the right of rightmost downlink?
> + */

* If you are going to save a low key for the target page in memory,
then you only need to do so for "state->readonly"/parent verification.

* You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low
key". This is a term that nbtsort.c now uses, in case you didn't know.

* The reason for saving a low key for each target page is very
unclear. Can we fix this?

The low key hardly ever gets used -- I can comment out the code that
uses a low key within bt_downlink_connectivity_check(), and indexes
that I tested appear fine. So I imagine this is for incomplete split
cases -- right?

I tested incomplete split cases using this simple hack:

diff --git a/src/backend/access/nbtree/nbtinsert.c
b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..5811338584 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel,
*/
_bt_relbuf(rel, rbuf);

- if (pbuf == InvalidBuffer)
+ if (random() <= (MAX_RANDOM_VALUE / 100))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("failed to re-find parent key in
index \"%s\" for split pages %u/%u",

Now, 1% of all page splits will fail after the first phase finishes,
but before the second phase begins/finishes. The incomplete split flag
will be found set by other, future inserts, though, at which point the
split will be finished (unless we're unlucky again).

An INSERT that inserts many rows to bound to fail with this hack
applied, but the INSERT shouldn't leave the index in a state that
looks like corruption to amcheck. I can see false positive reports of
corruption like this, though.

Consider the following sample session. The session inserts data from
one table into another table with an equivalent schema. As you can
see, I have to get my INSERT to fail three times before I see the
problem:

pg(at)tpce:5432 [13026]=# truncate holding2;
TRUNCATE TABLE
pg(at)tpce:5432 [13026]=# insert into holding2 select * from holding;
ERROR: failed to re-find parent key in index "pk_holding2" for split
pages 83/158
pg(at)tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true);
DEBUG: verifying level 1 (true root level)
DEBUG: verifying 200 items on internal block 3
DEBUG: verifying level 0 (leaf level)
DEBUG: verifying 262 items on leaf block 1
DEBUG: verifying 258 items on leaf block 2
*** SNIP ***
DEBUG: verifying 123 items on leaf block 201
DEBUG: verifying that tuples from index "i_holding2" are present in "holding2"
DEBUG: finished verifying presence of 0 tuples from table "holding2"
with bitset 4.83% set
┌───────────────────────┐
│ bt_index_parent_check │ <-- No false positive yet
├───────────────────────┤
│ │
└───────────────────────┘
(1 row)

pg(at)tpce:5432 [13026]=# insert into holding2 select * from holding;
DEBUG: finishing incomplete split of 83/158
ERROR: failed to re-find parent key in index "i_holding2" for split
pages 435/436
pg(at)tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
DEBUG: verifying level 2 (true root level)
DEBUG: verifying 2 items on internal block 303
*** SNIP ***
DEBUG: verifying 74 items on leaf block 436
DEBUG: verifying that tuples from index "i_holding2" are present in "holding2"
DEBUG: finished verifying presence of 0 tuples from table "holding2"
with bitset 9.97% set
┌───────────────────────┐
│ bt_index_parent_check │ <-- No false positive yet
├───────────────────────┤
│ │
└───────────────────────┘
(1 row)

pg(at)tpce:5432 [13026]=# insert into holding2 select * from holding;
ERROR: failed to re-find parent key in index "i_holding2" for split
pages 331/577
pg(at)tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true);
DEBUG: verifying level 2 (true root level)
DEBUG: verifying 3 items on internal block 303
DEBUG: verifying level 1
DEBUG: verifying 151 items on internal block 3
DEBUG: verifying 189 items on internal block 521
DEBUG: verifying 233 items on internal block 302
WARNING: 577 rightsplit
ERROR: leaf index block lacks downlink in index "i_holding2". <--
False positive "corruption"
DETAIL: Block=127 page lsn=0/0.

Notice I added a custom WARNING here, and we get the false positive
error just after the point that the WARNING is seen. This WARNING is
triggered by the temporary instrumentation change that I added to
amcheck, and can be seen in some cases that don't have an accompanying
false positive report of corruption:

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index d6cb10e263..dcfee9e4fa 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state,
errmsg("circular link chain found in block %u of
index \"%s\"",
blkno, RelationGetRelationName(state->rel))));

+ if (rightsplit)
+ elog(WARNING, "%u rightsplit", blkno);
if (!first && !P_IGNORE(opaque))
{
/* blkno probably has missing parent downlink */
*** END DIFF ***

Anyway, I didn't take the time to debug this today, but I suspect that
the problem is that the lowkey thing is kind of brittle. If you can't
figure it out, let me know and I'll help with it.

* Can we reverse the order here, so that the common case (i.e. the
offset_is_negative_infinity() case) comes first?:

> + if (offset_is_negative_infinity(topaque, pivotkey_offset))
> + {
> + /*
> + * We're going to try match child high key to "negative
> + * infinity key". That means we should match to high key of
> + * left sibling of target page.
> + */
*** SNIP ***

> + }
> + else
> + {
*** SNIP ***

> + }

* Can the offset_is_negative_infinity() branch explain what's going on
at a much higher level than this?

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-02-18 22:18:50 Re: Memory-Bounded Hash Aggregation
Previous Message Thomas Munro 2020-02-18 20:54:51 Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()