Re: Online verification of checksums

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Online verification of checksums
Date: 2020-10-20 09:11:03
Message-ID: 20201020091103.GA1475@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all? It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.

I was reviewing this area today, and that actually matches my
impression. Why do we need a LSN-based check at all? As said
upthread, that's of course weak with random data as we would miss most
of the real checksum failures, with odds getting better depending on
the current LSN of the cluster moving on. However, it seems to me
that we would have an extra advantage in removing this check
all together: it would be possible to check for pages even if these
are more recent than the start LSN of the backup, and that could be a
lot of pages that could be checked on a large cluster. So by keeping
this check we also delay the detection of real problems. As things
stand, I'd like to think that it would be much more useful to remove
this check and to have one or two extra retries (the current code only
has one). I don't like much the possibility of false positives for
such critical checks, but as we need to live with what has been
released, that looks like a good move for stable branches.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2020-10-20 09:30:03 Re: Is Recovery actually paused?
Previous Message Peter Eisentraut 2020-10-20 08:58:18 select_common_typmod