Re: new heapcheck contrib module

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-05-14 00:18:40
Message-ID: 4D704A12-2546-4789-90B0-B0445BFC717B@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On May 13, 2020, at 3:29 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>> Hmm. I think we should (try to?) write code that avoids all crashes
>> with production builds, but not extend that to assertion failures.
>
> Assertions are only a problem at all because Mark would like to write
> tests that involve a selection of truly corrupt data. That's a new
> requirement, and one that I have my doubts about.
>
>>> I'll stick with your example. You're calling
>>> TransactionIdDidCommit() from check_tuphdr_xids(), which will
>>> interrogate the commit log and pg_subtrans. It's just not under your
>>> control.
>>
>> in a production build this would just fail with an error that the
>> pg_xact file cannot be found, which is fine -- if this happens in a
>> production system, you're not disturbing any other sessions. Or maybe
>> the file is there and the byte can be read, in which case you would get
>> the correct response; but that's fine too.
>
> I think that this is fine, too, since I don't consider assertion
> failures with corrupt data all that important. I'd make some effort to
> avoid it, but not too much, and not at the expense of a useful general
> purpose assertion that could catch bugs in many different contexts.

I am not removing any assertions. I do not propose to remove any assertions. When I talk about "hardening against assertions", that is not in any way a proposal to remove assertions from the code. What I'm talking about is writing the amcheck contrib module code in such a way that it only calls a function that could assert on bad data after checking that the data is not bad.

I don't know that hardening against assertions in this manner is worth doing, but this is none the less what I'm talking about. You have made decent arguments that it probably isn't worth doing for the btree checking code. And in any event, it is probably something that could be addressed in a future patch after getting this patch committed.

There is a separate but related question in the offing about whether the backend code, independently of any amcheck contrib stuff, should be more paranoid in how it processes tuples to check for corruption. The heap deform tuple code in question is on a pretty hot code path, and I don't know that folks would accept the performance hit of more checks being done in that part of the system, but that's pretty far from relevant to this patch. That should be hashed out, or not, at some other time on some other thread.

> I would be willing to make a larger effort to avoid crashing a
> backend, since that affects production. I might go to some effort to
> not crash with downright adversarial inputs, for example. But it seems
> inappropriate to take extreme measures just to avoid a crash with
> extremely contrived inputs that will probably never occur.

I think this is a misrepresentation of the tests that I've been running. There are two kinds of tests that I have done:

First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived. That was included in the regression test suite because it needed to be something other developers could read, verify, "yeah, I can see why that would be corruption, and would give an error message of the sort the test expects", and then could be run to verify that indeed that expected error message was generated.

The second kind of corruption test I have been running is nothing more than writing random nonsense into randomly chosen locations within heap files and then running verify_heapam against those heap relations. It's much more Murphy than Machiavelli when it's just generated by calling random(). When I initially did this kind of testing, the heapam checking code had lots of problems. Now it doesn't. There's very little contrived about that which I can see. It's the kind of corruption you'd expect from any number of faulty storage systems. The one "contrived" aspect of my testing in this regard is that the script I use to write random nonsense to random locations in heap files is smart enough not to write random junk to the page headers. This is because if I corrupt the page headers, the backend never even gets as far as running the verify_heapam functions, as the page cache rejects loading the page.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-05-14 00:36:36 Re: new heapcheck contrib module
Previous Message Peter Geoghegan 2020-05-14 00:01:55 Re: new heapcheck contrib module