Re: pg_amcheck contrib application

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Amul Sul <sulamul(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_amcheck contrib application
Date: 2021-03-04 22:04:37
Message-ID: CAH2-WznsRybLrkJY2E++oXmt531p45jpMiigwDSc4y2A6f9C4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 4, 2021 at 7:29 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think this whole approach is pretty suspect because the number of
> blocks in the relation can increase (by relation extension) or
> decrease (by VACUUM or TRUNCATE) between the time when we query for
> the list of target relations and the time we get around to executing
> any queries against them. I think it's OK to use the number of
> relation pages for progress reporting because progress reporting is
> only approximate anyway, but I wouldn't print them out in the progress
> messages, and I wouldn't try to fix up the startblock and endblock
> arguments on the basis of how long you think that relation is going to
> be.

I don't think that the struct AmcheckOptions block fields (e.g.,
startblock) should be of type 'long' -- that doesn't work well on
Windows, where 'long' is only 32-bit. To be fair we already do the
same thing elsewhere, but there is no reason to repeat those mistakes.
(I'm rather suspicious of 'long' in general.)

I think that you could use BlockNumber + strtoul() without breaking Windows.

> There are a LOT of things that can go wrong when we go try to run
> verify_heapam on a table. The table might have been dropped; in fact,
> on a busy production system, such cases are likely to occur routinely
> if DDL is common, which for many users it is. The system catalog
> entries might be screwed up, so that the relation can't be opened.
> There might be an unreadable page in the relation, either because the
> OS reports an I/O error or something like that, or because checksum
> verification fails. There are various other possibilities. We
> shouldn't view such errors as low-level things that occur only in
> fringe cases; this is a corruption-checking tool, and we should expect
> that running it against messed-up databases will be common. We
> shouldn't try to interpret the errors we get or make any big decisions
> about them, but we should have a clear way of reporting them so that
> the user can decide what to do.

I agree.

Your database is not supposed to be corrupt. Once your database has
become corrupt, all bets are off -- something happened that was
supposed to be impossible -- which seems like a good reason to be
modest about what we think we know.

The user should always see the unvarnished truth. pg_amcheck should
not presume to suppress errors from lower level code, except perhaps
in well-scoped special cases.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-03-04 22:08:22 Re: Fix DROP TABLESPACE on Windows with ProcSignalBarrier?
Previous Message Thomas Munro 2021-03-04 22:02:01 Make relfile tombstone files conditional on WAL level