Re: Online verification of checksums

From: David Steele <david(at)pgmasters(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Online verification of checksums
Date: 2020-11-20 14:50:33
Message-ID: 6eb9697d-5eca-8b9a-6f1a-44e9f13054b7@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Michael,

On 11/20/20 2:28 AM, Michael Paquier wrote:
> On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote:
>> I was referring to the latest patch on the thread. But as I said, I have
>> not read up on all the different issues raised in the thread, so take it
>> with a big grain os salt.
>>
>> And I would also echo the previous comment that this code was adapted from
>> what the pgbackrest folks do. As such, it would be good to get a comment
>> from for example David on that -- I don't see any of them having commented
>> after that was mentioned?
>
> Agreed. I am adding Stephen as well in CC. From the code of
> backrest, the same logic happens in src/command/backup/pageChecksum.c
> (see pageChecksumProcess), where two checks on pd_upper and pd_lsn
> happen before verifying the checksum. So, if the page header finishes
> with random junk because of some kind of corruption, even corrupted
> pages would be incorrectly considered as correct if the random data
> passes the pd_upper and pg_lsn checks :/

Indeed, this is not good, as Andres pointed out some time ago. My
apologies for not getting to this sooner.

Our current plan for pgBackRest:

1) Remove the LSN check as you have done in your patch and when
rechecking see if the page has become valid *or* the LSN is ascending.
2) Check the LSN against the max LSN reported by PostgreSQL to make sure
it is valid.

These do completely rule out any type of corruption, but they certainly
narrows the possibility by a lot.

In the future we would also like to scan the WAL to verify that the page
is definitely being written to.

As for your patch, it mostly looks good but my objection is that a page
may be reported as invalid after 5 retries when in fact it may just be
very hot.

Maybe checking for an ascending LSN is a good idea there as well? At
least in that case we could issue a different warning, instead of
"checksum verification failed" perhaps "checksum verification skipped
due to concurrent modifications".

Regards,
--
-David
david(at)pgmasters(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2020-11-20 14:54:24 Re: parsing pg_ident.conf
Previous Message Simon Riggs 2020-11-20 14:47:33 Re: VACUUM (DISABLE_PAGE_SKIPPING on)