Re: amcheck (B-Tree integrity checking tool)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2016-03-29 03:13:11
Message-ID: CAM3SWZQ-vuss3ttYOHjv86s+zp2oaj6A4CrQDjmXBQfbCGoZsA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> That's right - I have a small number of feedback items to work
> through. I also determined myself that there could be a very low
> probability race condition when checking the key space across sibling
> pages, and will work to address that.

I attach a V2 revision.

Behavioral changes:

* Fixes very low probability race condition when verifying ordering
across sibling pages with bt_index_check()/AccessShareLock.

Notably, the fix does not involve using the high key from the next
page instead of the first item, despite my recent suggestion that it
would; I changed my mind. In this revision,
bt_right_page_check_scankey() continues to return the first item on
the next/right page (as a scankey), just as in the first version. The
locking considerations here are more complicated than before, and
hopefully correct. I abandoned the high key idea when I realized that
a concurrent page split could happen in the right sibling, making the
high key generally no more safe in the face of concurrent target page
deletion than continuing to use the first data item. Using the first
data item from the right page is better than using the high key item
from the right page verification-wise anyway, so that's good. I found
a new way of locking down and recovering from any race conditions (the
low probability bug in V1 for bt_index_check()/AccessShareLock calls
to the C function bt_right_page_check_scankey()). I won't go into the
gory details here, but I will mention that my new approach is inspired
by how backwards scans work already.

* Adds DEBUG1 traces that indicate the level and page currently being checked.

Advanced users will probably find these traces useful on occasion.
It's nice to be able to see how amcheck walks the B-Tree in practice.

Non-behavioral changes:

* Explains why an ExclusiveLock is needed on the target index by
bt_index_parent_check() (and a ShareLock on its heap relation).

This new explanation talks about why an ExclusiveLock *is* necessary
when checking downlinks against child pages, per Tomas Vondra's
request. (Note that nothing changes about child/downlink verification
itself). Before now, I focused on why they were not necessary, but
Tomas was right to point out that that isn't enough information.

* No longer uses the word "comport" in an error message, per Amit
Langote. Similarly, minor typos fixed, per Tomas.

* Adds a variety of pg_regress-based tests for external sorting. (Via
various CREATE INDEX statements involving a variety of datatypes. Data
is generated pseudo-randomly.)

Notably, the regression tests *never* test external sorting today.

ISTM that amcheck is a convenient, reliable, and portable way of
testing the sorting of collatable types like text, so this is the best
way of testing external sorting. External sorting uses abbreviated
keys for runs, but discards them as runs are written out, and so the
merging of runs does not use abbreviated keys. Abbreviated comparisons
and authoritative comparisons are therefore reliably used within the
same sort. One particular goal here is that something like the recent
strxfrm() debacle might be caught by this testing on some platforms
(if we ever get back to trusting strxfrm() at all, that is). The style
of these new tests is unorthodox, because a variety of collations are
tested, which varies from system to system; we should discuss all
this. The tests can take a minute or two to run on my laptop, but that
could vary significantly. And so, whether or not that needs to be
targeted by something other than "check" and "installcheck" is a
discussion that needs to happen. I imagine a lot of buildfarm animals
have slow storage, but on the other hand I don't think hackers run the
contrib regression tests so often. I prefer to not follow the example
of test_decoding if possible; test_decoding's Makefile only runs tests
when wal_level=logical is expected (that's how the relevant Makefile
targets are scoped), but it's a complicated special case.

Review
------

As already noted after the first bullet point, there are some tricky
considerations on fixing the race when the user calls
bt_index_check(). Reviewers should pay close attention to this; could
I be wrong about that being race-safe even now? This seems to me to be
by far the best place for reviewers to look for problems in this
patch, as it's the only bt_index_check()/AccessShareLock case that
compares anything *across* pages. The measures taken to prevent false
positives described within bt_target_page_check() require expert
review, especially because this is the one case where a faulty
rationale may result in under-protection from false positives, and not
just harmless over-protection.

The rationale for why we can reliably recover from a race described
within the C function bt_right_page_check_scankey() is *complicated*.
I may have failed to make it as simple as possible despite significant
effort. It's hard to describe these things in an accessible way. Maybe
someone can try and understand what I've said there, and let me know
how I might have fallen short. I have used a new term, "cousin page"
in this explanation (cf. sibling page). This seems like useful
terminology that makes these difficult explanations a bit more
accessible.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Add-amcheck-extension-to-contrib.patch text/x-patch 73.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2016-03-29 03:18:17 Re: raw output from copy
Previous Message Andrew Dunstan 2016-03-29 03:12:05 Re: raw output from copy