Re: Online verification of checksums

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Online verification of checksums
Date: 2020-11-05 01:57:16
Message-ID: 20201105015716.GC1632@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 04, 2020 at 05:41:39PM +0100, Michael Banck wrote:
> Am Mittwoch, den 04.11.2020, 17:48 +0900 schrieb Michael Paquier:
>> So, I have done much more testing of this patch using an instance with
>> a small shared buffer pool and pgbench running in parallel for having
>> a large eviction rate, and I cannot convince myself to do that. My
>> laptop got easily constrained on I/O, and within a total of 2000 base
>> backups or so, I have seen some 5 backup failures with a correct
>> detection logic.
>
> I don't quite undestand what you mean here: how do the base backups
> fail,

As of basebackup.c, on HEAD:
if (total_checksum_failures)
{
if (total_checksum_failures > 1)
ereport(WARNING,
(errmsg_plural("%lld total checksum verification failure",
"%lld total checksum verification failures",
total_checksum_failures,
total_checksum_failures)));

ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("checksum verification failure during base backup")));
}

This means that when at least one page verification fails,
pg_basebackup would fail.

> and what exactly is "correct detection logic"?

I was referring to the patch I sent on this thread that fixes the
detection of a corruption for the zero-only case and where pd_lsn
and/or pg_upper are trashed by a corruption of the page header. Both
cases allow a base backup to complete on HEAD, while sending pages
that could be corrupted, which is wrong. Once you make the page
verification rely only on pd_checksum, as the patch does because the
checksum is the only source of truth in the page header, corrupted
pages are correctly detected, causing pg_basebackup to complain as it
should. However, it has also the risk to cause pg_basebackup to fail
*and* to report as broken pages that are in the process of being
written, depending on how slow a disk is able to finish a 8kB write.
That's a different kind of wrongness, and users have two more reasons
to be pissed. Note that if a page is found as torn we have a
consistent page header, meaning that on HEAD the PageIsNew() and
PageGetLSN() would pass, but the checksum verification would fail as
the contents at the end of the page does not match the checksum.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-11-05 02:18:42 Re: Some doubious code in pgstat.c
Previous Message Bharath Rupireddy 2020-11-05 01:52:01 Re: Log message for GSS connection is missing once connection authorization is successful.