Re: new heapcheck contrib module

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2021-02-17 20:56:24
Message-ID: CA+Tgmob2c0eM8+5kzkXaqdc9XbBCkHmtihSOSk-jCzzT4BFhFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> It will reconnect and retry a command one time on error. That should cover the case that the connection to the database was merely lost. If the second attempt also fails, no further retry of the same command is attempted, though commands for remaining relation targets will still be attempted, both for the database that had the error and for other remaining databases in the list.
>
> Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per relation in db2 before finally moving on to db3. That seems pretty awful considering how many relations that could be, but failing to soldier on in the face of errors seems a strange design for a corruption checking tool.

That doesn't seem right at all. I think a PQsendQuery() failure is so
remote that it's probably justification for giving up on the entire
operation. If it's caused by a problem with some object, it probably
means that accessing that object caused the whole database to go down,
and retrying the object will take the database down again. Retrying
the object is betting that the user interrupted connectivity between
pg_amcheck and the database but the interruption is only momentary and
the user actually wants to complete the operation. That seems unlikely
to me. I think it's far more probably that the database crashed or got
shut down and continuing is futile.

My proposal is: if we get an ERROR trying to *run* a query, give up on
that object but still try the other ones after reconnecting. If we get
a FATAL or PANIC trying to *run* a query, give up on the entire
operation. If even sending a query fails, also give up.

> In v39, exit(1) is used for all errors which are intended to stop the program. It is important to recognize that finding corruption is not an error in this sense. A query to verify_heapam() can fail if the relation's checksums are bad, and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers. There can be errors if the file backing a relation is missing. There may be other corruption error cases that I have not yet thought about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them. As discussed above, failing to send the query to the server is not viewed as a reason to exit, either. It would be hard to quantify all the failure modes, but presumably the catalogs for a database could be messed up enough to cause such failures, and I'm not sure that pg_amcheck should just abort.

I agree that exit(1) should happen after any error intended to stop
the program. But I think it should also happen at the end of the run
if we hit any problems for which we did not stop, so that exit(0)
means your database is healthy.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-17 21:00:36 Re: Some regular-expression performance hacking
Previous Message David Rowley 2021-02-17 20:45:19 Re: Tid scan improvements