Re: new heapcheck contrib module

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2021-02-23 17:38:29
Message-ID: C67CBC40-512A-4CB7-8AF1-8810C6615238@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Feb 17, 2021, at 12:56 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
> <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>> It will reconnect and retry a command one time on error. That should cover the case that the connection to the database was merely lost. If the second attempt also fails, no further retry of the same command is attempted, though commands for remaining relation targets will still be attempted, both for the database that had the error and for other remaining databases in the list.
>>
>> Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per relation in db2 before finally moving on to db3. That seems pretty awful considering how many relations that could be, but failing to soldier on in the face of errors seems a strange design for a corruption checking tool.
>
> That doesn't seem right at all. I think a PQsendQuery() failure is so
> remote that it's probably justification for giving up on the entire
> operation. If it's caused by a problem with some object, it probably
> means that accessing that object caused the whole database to go down,
> and retrying the object will take the database down again. Retrying
> the object is betting that the user interrupted connectivity between
> pg_amcheck and the database but the interruption is only momentary and
> the user actually wants to complete the operation. That seems unlikely
> to me. I think it's far more probably that the database crashed or got
> shut down and continuing is futile.
>
> My proposal is: if we get an ERROR trying to *run* a query, give up on
> that object but still try the other ones after reconnecting. If we get
> a FATAL or PANIC trying to *run* a query, give up on the entire
> operation. If even sending a query fails, also give up.

This is changed in v40 as you propose to exit on FATAL and PANIC level errors and on error to send a query. On lesser errors (which includes all corruption reports about btrees and some heap corruption related errors), the slot's connection is still useable, I think. Are there cases where the error is lower than FATAL and yet the connection needs to be reestablished? It does not seem so from the testing I have done, but perhaps I'm not thinking of the right sort of non-fatal error?

(I'll wait to post v40 until after hearing your thoughts on this.)

>> In v39, exit(1) is used for all errors which are intended to stop the program. It is important to recognize that finding corruption is not an error in this sense. A query to verify_heapam() can fail if the relation's checksums are bad, and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers. There can be errors if the file backing a relation is missing. There may be other corruption error cases that I have not yet thought about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them. As discussed above, failing to send the query to the server is not viewed as a reason to exit, either. It would be hard to quantify all the failure modes, but presumably the catalogs for a database could be messed up enough to cause such failures, and I'm not sure that pg_amcheck should just abort.
>
> I agree that exit(1) should happen after any error intended to stop
> the program. But I think it should also happen at the end of the run
> if we hit any problems for which we did not stop, so that exit(0)
> means your database is healthy.

In v40, exit(1) means the program encountered fatal errors leading it to stop, and exit(2) means that a non-fatal error and/or corruption reports occurred somewhere during the processing. Otherwise, exit(0) means your database was successfully checked and is healthy.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-23 17:39:09 Re: Some regular-expression performance hacking
Previous Message Andres Freund 2021-02-23 17:34:37 Re: Some regular-expression performance hacking