Re: Online checksums verification in the backend

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-10-19 07:52:48
Message-ID: 20201019075248.GD9612@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 19, 2020 at 11:16:38AM +0800, Julien Rouhaud wrote:
> On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> Somewhat related to the first point, NoComputedChecksum exists
>> because, as the current patch is shaped, we need to report an existing
>> checksum to the user even for the zero-only case.
>
> I'm not sure that I understand your point. The current patch only
> returns something to users when there's a corruption. If by
> "zero-only case" you mean "page corrupted in a way that PageIsNew()
> returns true while not being all zero", then it's a corrupted page and
> then obviously yes it needs to be returned to users.

Sorry for the confusion, this previous paragraph was confusing. I
meant that the reason why NoComputedChecksum exists is that we give up
on attempting to calculate the checksum if we detect that the page is
new, but failed the zero-only test, and that we want the users to know
about this special case by setting this expected checksum to NULL for
the SRF.

>> So, wouldn't it be better to just rely on PageIsVerified()
>> (well it's rather-soon-to-be extended flavor) for the checksum check,
>> the header sanity check and the zero-only check? My point is to keep
>> a single entry point for all the page sanity checks, so as base
>> backups, your patch, and the buffer manager apply the same things.
>> Base backups got that partially wrong because the base backup code
>> wants to keep control of the number of failures and the error
>> reports.
>
> I'm fine with it.

Thanks.

>> Your patch actually wishes to report a failure, but you want
>> to add more context with the fork name and such. Another option we
>> could use here is to add an error context so as PageIsVerified()
>> reports the WARNING, but the SQL function provides more context with
>> the block number and the relation involved in the check.
>
> Also, returning actual data rather than a bunch of warnings is way
> easier to process for client code. And as mentioned previously having
> an API that returns a list of corrupted blocks could be useful for a
> single-page recovery feature.

No issues with reporting the block number and the fork type in the SRF
at least of course as this is helpful to detect the position of the
broken blocks. For the checksum found in the header and the one
calculated (named expected in the patch), I am less sure how to put a
clear definition on top of that but we could always consider that
later and extend the SRF as needed. Once the user knows that both do
not match, he/she knows that there is a problem.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Shulga 2020-10-19 07:58:15 Re: Reduce the time required for a database recovery from archive.
Previous Message Masahiko Sawada 2020-10-19 07:50:13 Re: Resetting spilled txn statistics in pg_stat_replication