Re: Improve search for missing parent downlinks in amcheck

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve search for missing parent downlinks in amcheck
Date: 2020-02-18 10:15:47
Message-ID: CAPpHfdt6DdEDcub6378iiLiQhoYhaxZu+neczMxH=bYcED5T5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Peter!

On Fri, Jan 24, 2020 at 4:31 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Wed, Jan 22, 2020 at 6:41 PM Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > Rebased patch is attached. Sorry for so huge delay.
>
> I really like this patch. Your interest in amcheck is something that
> makes me feel good about having put so much work into it myself.
>
> Here are some review comments:

Great, thank you very much!

> > + /*
> > + * Rightlink and incomplete split flag of previous block referenced by
> > + * downlink.
> > + */
> > + BlockNumber prevrightlink;
> > + bool previncompletesplit;
> > +
>
> What downlink? What does this mean? Do you mean the most recently
> followed rightlink on the current level, or InvalidBlockNumber if
> target page is the leftmost page on the current level on the scan?
>
> (Thinks some more...)
>
> Actually, these two new fields track the state *one level down* from
> the target page level when !readonly (unless target page is on the
> leaf level). Right? Comments should be explicit about this. The
> current comments about downlinks isn't clear.

I agree. I've used very vague terms in comments. Revised now.

> > if (offset_is_negative_infinity(topaque, offset))
> > + {
> > + /*
> > + * Initialize downlink connectivity check if needed.
> > + */
> > + if (!P_ISLEAF(topaque) && state->readonly)
> > + {
> > + bt_downlink_connectivity_check(state,
> > + offset,
> > + NULL,
> > + topaque->btpo.level);
> > + }
> > continue;
> > + }
>
> Don't need the "!P_ISLEAF()" here.

Why don't I need. bt_downlink_connectivity_check() checks one level
down to the target level. But there is no one level down to leaf...

> Also, you should say something like
> "we need to call this here because the usual callsite in
> bt_downlink_check() won't be reached".

Sure, fixed.

> > /*
> > - * * Check if page has a downlink in parent *
> > - *
> > - * This can only be checked in heapallindexed + readonly case.
> > + * If we traversed the whole level to the rightmost page, there might be
> > + * missing downlinks for the pages to the right of rightmost downlink.
> > + * Check for them.
> > */
>
> You mean "to the right of the child page pointed to by our rightmost downlink"?

Yep, fixed.

> I think that the final bt_downlink_connectivity_check() call within
> bt_target_page_check() should make it clear that it is kind of special
> compared to the other two calls.

Yes, this is fixed too.

> > +/*
> > + * Check connectivity of downlinks. Traverse rightlinks from previous downlink
> > + * to the current one. Check that there are no intermediate pages with missing
> > + * downlinks.
> > + *
> > + * If 'loaded_page' is given, it's assumed to be contents of downlink
> > + * referenced by 'downlinkoffnum'.
> > + */
>
> Say "assumed to be the page pointed to by the downlink", perhaps?

Yes, fixed.

> > +static void
> > +bt_downlink_connectivity_check(BtreeCheckState *state,
> > + OffsetNumber downlinkoffnum,
> > + Page loaded_page,
> > + uint32 parent_level)
> > +{
>
> In amcheck, we always have a current target page. Every page gets to
> be the target exactly once, though sometimes other subsidiary pages
> are accessed. We try to blame the target page, even with checks that
> are technically against its child/sibling/whatever. The target page is
> always our conceptual point of reference. Sometimes this is a bit
> artificial, but it's still worth doing consistently. So I think you
> should change these argument names with that in mind (see below).

Yes, the arguments were changes as you proposed.

> > + /*
> > + * If we visit page with high key, check that it should be equal to
> > + * the target key next to corresponding downlink.
> > + */
>
> I suggest "...check that it is equal to the target key..."

Agree, fixed.

> > + /*
> > + * There might be two situations when we examine high key. If
> > + * current child page is referenced by given downlink, we should
> > + * look to the next offset number for matching key.
>
> You mean "the next offset number for the matching key from the target
> page"? I find it much easier to keep this stuff in my head if
> everything is defined in terms of its relationship with the current
> target page. For example, bt_downlink_connectivity_check()'s
> "parent_level" argument should be called "target_level" instead, while
> its "loaded_page" should be called "loaded_child". Maybe
> "downlinkoffnum" should be "target_downlinkoffnum". And downlinkoffnum
> should definitely be explained in comments at the top of
> bt_downlink_connectivity_check() (e.g., say what it means when it is
> InvalidOffsetNumber).

Renamed as you proposed.

> > Alternatively
> > + * we might find child with high key while traversing from
> > + * previous downlink to current one. Then matching key resides
> > + * the same offset number as current downlink.
> > + */
>
> Not sure what "traversing from previous downlink to current one" means at all.

I've rephrased this comment, please check.

> > + if (!offset_is_negative_infinity(topaque, pivotkey_offset) &&
> > + pivotkey_offset <= PageGetMaxOffsetNumber(state->target))
> > + {
> > + uint32 cmp = _bt_compare(state->rel,
> > + skey,
> > + state->target,
> > + pivotkey_offset);
>
> There is no need to bother with a _bt_compare() here. Why not just use
> memcmp() with a pointer to itup->t_tid.ip_posid (i.e. memcmp() that
> skips the block number)? I think that it is better to expect the keys
> to be *identical* among pivot tuples, including within tuple alignment
> padding (only the downlink block number can be different here). If
> non-pivot tuples were involved then you couldn't do it this way, but
> they're never involved, so it makes sense. A memcmp() will be faster,
> obviously. More importantly, it has the advantage of not relying on
> opclass infrastructure in any way. It might be worth adding an
> internal verify_nbtree.c static helper function to do the memcmp() for
> you -- bt_pivot_tuple_identical(), or something like that.

Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It
becomes even simpler now, thanks!

> I think bt_downlink_check() and bt_downlink_connectivity_check()
> should be renamed to something broader. In my mind, downlink is
> basically a block number. We have been sloppy about using the term
> downlink when we really mean "pivot tuple with a downlink" -- I am
> guilty of this myself. But it seems more important, now that you have
> the new high key check.

Hmm... Names are hard for me. I didn't do any renaming for now. What
about this?
bt_downlink_check() => bt_child_check()
bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check()

> I particularly don't like the way you sometimes say "downlink" when
> you mean "child page". You do that in this error message:
>
> > + (errcode(ERRCODE_INDEX_CORRUPTED),
> > + errmsg("block found while traversing rightlinks from downlink of index \"%s\" has invalid level",
> > + RelationGetRelationName(state->rel)),

Agree. Error message has been changed.

> Typo here:
>
> > + /*
> > + * If no previos rightlink is memorized, get it from current downlink for
> > + * future usage.
> > + */
>
> You mean "previous". Also, I think that you should say "memorized for
> current level just below target page's level".

Yep, fixed.

> > * within bt_check_level_from_leftmost() won't reach the page either,
> > * since the leaf's live siblings should have their sibling links updated
> > - * to bypass the deletion target page when it is marked fully dead.)
> > + * to bypass the deletion page under check when it is marked fully dead.)
> > *
>
> This change seems wrong or unnecessary -- "deletion target" means
> "page undergoing deletion" (not necessarily marked P_ISDELETED() just
> yet), and has nothing to do with the amcheck target. You can change
> this if you want, but I don't get it.

Seems like random oversight. Change removed.

> I tested this by using pg_hexedit to corrupt the least significant
> byte of a text key in the root page:
>
> pg(at)tpce:5432 [32610]=# select bt_index_parent_check('pk_holding');
> DEBUG: verifying level 2 (true root level)
> DEBUG: verifying 9 items on internal block 290
> DEBUG: verifying level 1
> DEBUG: verifying 285 items on internal block 3
> ERROR: mismatch between parent key and child high key index "pk_holding"
> DETAIL: Parent block=3 child block=9 parent page lsn=998/EFA21550.
>
> Happy to see that this works, even though this is one of the subtlest
> possible forms of index corruption. Previously, we could sometimes
> catch this with "rootdescend" verification, but only if there were
> *current* items that a scan couldn't find on lower levels (often just
> the leaf level). But now it doesn't matter -- we'll always detect it.
> (I think.)

Thank you for testing!

> Shouldn't this error message read '...in index "pk_holding"'? You
> missed the "in". Also, why not have the DETAIL message call the
> "Parent block" the target block?

Yep, fixed.

> I think that bt_downlink_connectivity_check() should have some
> high-level comments about what it's supposed to do. Perhaps an example
> is the best way to explain the concepts. Maybe say something about a
> three level B-Tree. Each of the separator keys in the grandparent/root
> page should also appear as high keys at the parent level. Each of the
> separator keys in the parent level should also appear as high keys on
> the leaf level, including the separators from the parent level high
> keys. Since each separator defines which subtrees are <= and > of the
> separator, there must be an identical seam of separators (in high
> keys) on lower levels. bt_downlink_connectivity_check() verifies that
> separator keys agree across a single level, which verifies the
> integrity of the whole tree.

I've revised finding the matching pivot key for high key. Now, code
assumes we should always find a matching pivot key. It could use both
target high key or left sibling high key (which is memorized as "low
key").

I've checked this on normal indexes, but I didn't try to exercise this
with broken indexes. I would appreciate if you do.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
amcheck-btree-improve-missing-parent-downlinks-check-5.patch application/octet-stream 27.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-02-18 10:17:18 Re: Improve search for missing parent downlinks in amcheck
Previous Message Amit Langote 2020-02-18 09:56:23 Re: plan cache overhead on plpgsql expression