Re: new heapcheck contrib module

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-08-28 13:58:43
Message-ID: CA+TgmobgkU0U-XDcpRZvEWb-btOsiZXU3_Uf-15rge+JvqcJsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 28, 2020 at 1:07 AM Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG anyhow.
> Will your module be able to gather tid's of similar corruptions?
>
> server/db M # select * from heap_check('pg_toast.pg_toast_4848601');
> ERROR: 58P01: could not access status of transaction 636558742
> DETAIL: Could not open file "pg_xact/025F": No such file or directory.
> LOCATION: SlruReportIOError, slru.c:913
> Time: 3439.915 ms (00:03.440)

This kind of thing gets really tricky. PostgreSQL uses errors in tons
of places to report problems, and if you want to accumulate a list of
errors and report them all rather than just letting the first one
cancel the operation, you need special handling for each individual
error you want to bypass. A tool like this naturally wants to use as
much PostgreSQL infrastructure as possible, to avoid duplicating a ton
of code and creating a bloated monstrosity, but all that code can
throw errors. I think the code in its current form is trying to be
resilient against problems on the table pages that it is actually
checking, but it can't necessarily handle gracefully corruption in
other parts of the system. For instance:

- CLOG could be truncated, as in your example
- the disk files could have had their permissions changed so that they
can't be accessed
- the PageIsVerified() check might fail when pages are read
- the TOAST table's metadata in pg_class/pg_attribute/etc. could be corrupted
- ...or the files for those system catalogs could've had their
permissions changed
- ....or they could contain invalid pages
- ...or their indexes could be messed up

I think there are probably a bunch more, and I don't think it's
practical to allow this tool to continue after arbitrary stuff goes
wrong. It'll be too much code and impossible to maintain. In the case
you mention, I think we should view that as a problem with clog rather
than a problem with the table, and thus out of scope.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2020-08-28 13:58:58 Re: New default role- 'pg_read_all_data'
Previous Message Ranier Vilela 2020-08-28 13:37:08 Re: Clang UndefinedBehaviorSanitize (Postgres14) Detected undefined-behavior