Re: new heapcheck contrib module

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-08-05 20:36:40
Message-ID: CAH2-WznUqDJ+WjVD6R_BdCEa2FpV5fK+4No+HFzNLqfOf-+3Sw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 5, 2020 at 7:09 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Not sure I agree with this. As a homeowner, the distinction between 0
> and 1 is less significant to me than the distinction between a few
> (preferably in places where I'll never see them) and whole lot. I
> agree with you to an extent though: all I really care about is whether
> I have too few to worry about, enough that I'd better try to take care
> of it somehow, or so many that I need a professional exterminator. If,
> however, I were a professional exterminator, I would be unhappy with
> just knowing that there are few problems or many. I suspect I would
> want to know something about where the problems were, and get a more
> nuanced indication of just how bad things are in each location.

Right, but the professional exterminator can be expected to use expert
level tools, where a great deal of technical sophistication is
required to interpret what's going on sensibly. An amatuer can only
use them to determine if something is wrong at all, which is usually
not how they add value.

(I think that my analogy is slightly flawed in that it hinged upon
everybody hating cockroaches as much as I do, which is more than the
ordinary amount.)

> FWIW, pg_catcheck is an example of an existing tool (designed by me
> and written partially by me) that uses the kind of model I'm talking
> about. It does a single SELECT * FROM pg_<whatever> on each catalog
> table - so that it doesn't get confused if your system catalog indexes
> are messed up - and then performs a bunch of cross-checks on the
> tuples it gets back and tells you about all the messed up stuff. If it
> can't get data from all your catalog tables it performs whichever
> checks are valid given what data it was able to get. As a professional
> exterminator of catalog corruption, I find it quite helpful.

I myself seem to have had quite different experiences with corruption,
presumably because it happened at product companies like Heroku. I
tended to find software bugs (e.g. the one fixed by commit 008c4135)
that were rare and novel by casting a wide net over a large number of
relatively homogenous databases. Whereas your experiences tend to
involve large support customers with more opportunity for operator
error. Both perspectives are important.

> The second group is a lot harder. It is in general difficult to
> speculate about how something that is now wrong got that way given
> knowledge only of the present state of affairs. But good tooling makes
> it easier to speculate intelligently. To take a classic example,
> there's a great difference between a checksum failure caused by the
> checksum being incorrect on an otherwise-valid page; a checksum
> failure on a page the first half of which appears valid and the second
> half of which looks like it might be some other database page; and a
> checksum failure on a page whose contents appear to be taken from a
> Microsoft Word document. I'm not saying we ever want a tool which
> tries to figure that sort of thing out in an automated way; there's no
> substitute for human intelligence (yet, anyway).

I wrote my own expert level tool, pg_hexedit. I have to admit that the
level of interest in that tool doesn't seem to be all that great,
though I myself have used it to investigate corruption to great
effect. But I suppose there is no way to know how it's being used.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2020-08-05 20:38:38 Re: Support for NSS as a libpq TLS backend
Previous Message Robert Haas 2020-08-05 20:20:48 Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)