Re: amcheck (B-Tree integrity checking tool)

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)heroku(dot)com>, David Steele <david(at)pgmasters(dot)net>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: amcheck (B-Tree integrity checking tool)
Date: 2016-04-08 13:05:06
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

29.03.2016 06:13, Peter Geoghegan:
> On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> That's right - I have a small number of feedback items to work
>> through. I also determined myself that there could be a very low
>> probability race condition when checking the key space across sibling
>> pages, and will work to address that.
> I attach a V2 revision.
> Behavioral changes:
> * Fixes very low probability race condition when verifying ordering
> across sibling pages with bt_index_check()/AccessShareLock.
> Notably, the fix does not involve using the high key from the next
> page instead of the first item, despite my recent suggestion that it
> would; I changed my mind. In this revision,
> bt_right_page_check_scankey() continues to return the first item on
> the next/right page (as a scankey), just as in the first version. The
> locking considerations here are more complicated than before, and
> hopefully correct. I abandoned the high key idea when I realized that
> a concurrent page split could happen in the right sibling, making the
> high key generally no more safe in the face of concurrent target page
> deletion than continuing to use the first data item. Using the first
> data item from the right page is better than using the high key item
> from the right page verification-wise anyway, so that's good. I found
> a new way of locking down and recovering from any race conditions (the
> low probability bug in V1 for bt_index_check()/AccessShareLock calls
> to the C function bt_right_page_check_scankey()). I won't go into the
> gory details here, but I will mention that my new approach is inspired
> by how backwards scans work already.
> * Adds DEBUG1 traces that indicate the level and page currently being checked.
> Advanced users will probably find these traces useful on occasion.
> It's nice to be able to see how amcheck walks the B-Tree in practice.
> Non-behavioral changes:
> * Explains why an ExclusiveLock is needed on the target index by
> bt_index_parent_check() (and a ShareLock on its heap relation).
> This new explanation talks about why an ExclusiveLock *is* necessary
> when checking downlinks against child pages, per Tomas Vondra's
> request. (Note that nothing changes about child/downlink verification
> itself). Before now, I focused on why they were not necessary, but
> Tomas was right to point out that that isn't enough information.
> * No longer uses the word "comport" in an error message, per Amit
> Langote. Similarly, minor typos fixed, per Tomas.
> * Adds a variety of pg_regress-based tests for external sorting. (Via
> various CREATE INDEX statements involving a variety of datatypes. Data
> is generated pseudo-randomly.)
> Notably, the regression tests *never* test external sorting today.
> ISTM that amcheck is a convenient, reliable, and portable way of
> testing the sorting of collatable types like text, so this is the best
> way of testing external sorting. External sorting uses abbreviated
> keys for runs, but discards them as runs are written out, and so the
> merging of runs does not use abbreviated keys. Abbreviated comparisons
> and authoritative comparisons are therefore reliably used within the
> same sort. One particular goal here is that something like the recent
> strxfrm() debacle might be caught by this testing on some platforms
> (if we ever get back to trusting strxfrm() at all, that is). The style
> of these new tests is unorthodox, because a variety of collations are
> tested, which varies from system to system; we should discuss all
> this. The tests can take a minute or two to run on my laptop, but that
> could vary significantly. And so, whether or not that needs to be
> targeted by something other than "check" and "installcheck" is a
> discussion that needs to happen. I imagine a lot of buildfarm animals
> have slow storage, but on the other hand I don't think hackers run the
> contrib regression tests so often. I prefer to not follow the example
> of test_decoding if possible; test_decoding's Makefile only runs tests
> when wal_level=logical is expected (that's how the relevant Makefile
> targets are scoped), but it's a complicated special case.
> Review
> ------
> As already noted after the first bullet point, there are some tricky
> considerations on fixing the race when the user calls
> bt_index_check(). Reviewers should pay close attention to this; could
> I be wrong about that being race-safe even now? This seems to me to be
> by far the best place for reviewers to look for problems in this
> patch, as it's the only bt_index_check()/AccessShareLock case that
> compares anything *across* pages. The measures taken to prevent false
> positives described within bt_target_page_check() require expert
> review, especially because this is the one case where a faulty
> rationale may result in under-protection from false positives, and not
> just harmless over-protection.
> The rationale for why we can reliably recover from a race described
> within the C function bt_right_page_check_scankey() is *complicated*.
> I may have failed to make it as simple as possible despite significant
> effort. It's hard to describe these things in an accessible way. Maybe
> someone can try and understand what I've said there, and let me know
> how I might have fallen short. I have used a new term, "cousin page"
> in this explanation (cf. sibling page). This seems like useful
> terminology that makes these difficult explanations a bit more
> accessible.

I'm not an expert in btree locking races, but I tested the patch it on
different loads and it works as promised.
I didn't find mistakes in the code logic as well.

As a reviewer, I don't have any objections to mark it "Ready for Committer".
The only notice I'd like to add is about it's compatibility with
covering indexes.
We already discussed it in the thread
Tiny attached patch fixes this issue.

Anastasia Lubennikova
Postgres Professional:
The Russian Postgres Company

Attachment Content-Type Size
amcheck_compatible_with_covering_indexes.patch text/x-patch 1.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2016-04-08 13:05:34 Re: VS 2015 support in src/tools/msvc
Previous Message Robert Haas 2016-04-08 12:58:57 Re: Optimization for updating foreign tables in Postgres FDW