Re: new heapcheck contrib module

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-05-13 00:34:32
Message-ID: CAH2-Wzkv0uek3wjZYVVKKRX1PF5DchympPmL77FKiXA1xLvn0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 11, 2020 at 10:21 AM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in which all errors are reported via ereport, and a new mode for returning errors as rows from a set returning function.

Somebody suggested that I make amcheck work in this way during its
initial development. I rejected that idea at the time, though. It
seems hard to make it work because the B-Tree index scan is a logical
order index scan. It's quite possible that a corrupt index will have
circular sibling links, and things like that. Making everything an
error removes that concern. There are clearly some failures that we
could just soldier on from, but the distinction gets rather blurred.

I understand why you want to do it this way. It makes sense that the
heap stuff would report all inconsistencies together, at the end. I
don't think that that's really workable (or even desirable) in the
case of B-Tree indexes, though. When an index is corrupt, the solution
is always to do root cause analysis, to make sure that the issue does
not recur, and then to REINDEX. There isn't really a question about
doing data recovery of the index structure.

Would it be possible to log the first B-Tree inconsistency, and then
move on to the next high-level phase of verification? You don't have
to throw an error, but it seems like a good idea for amcheck to still
give up on further verification of the index.

The assertion failure that you reported happens because of a generic
assertion made from _bt_compare(). It doesn't have anything to do with
amcheck (you'll see the same thing from regular index scans), really.
I think that removing that assertion would be the opposite of
hardening. Even if you removed it, the backend will still crash once
you come up with a slightly more evil index tuple. Maybe *that* could
be mostly avoided with widespread hardening; we could in principle
perform cross-checks of varlena headers against the tuple or page
layout at any point reachable from _bt_compare(). That seems like
something that would have unacceptable overhead, because the cost
would be imposed on everything. And even then you've only ameliorated
the problem.

Code like amcheck's PageGetItemIdCareful() goes further than the
equivalent backend macro (PageGetItemId()) to avoid assertion failures
and crashes with corrupt data. I doubt that it is practical to take it
much further than that, though. It's subject to diminishing returns.
In general, _bt_compare() calls user-defined code that is usually
written in C. This C code could in principle feel entitled to do any
number of scary things when you corrupt the input data. The amcheck
module's dependency on user-defined operator code is totally
unavoidable -- it is the single source of truth for the nbtree checks.

It boils down to this: I think that regression tests that run on the
buildfarm and actually corrupt data are not practical, at least in the
case of the index checks -- though probably in all cases. Look at the
pageinspect "btree.out" test output file -- it's very limited, because
we have to work around a bunch of implementation details. It's no
accident that the bt_page_items() test shows a palindrome value in the
data column (the value is "01 00 00 00 00 00 00 01"). That's an
endianness workaround.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2020-05-13 00:45:45 Re: Problem with logical replication
Previous Message Alvaro Herrera 2020-05-13 00:16:52 Re: pg13: xlogreader API adjust