Re: [patch] Fix checksum verification in base backups for zero page headers

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] Fix checksum verification in base backups for zero page headers
Date: 2020-10-22 06:52:01
Message-ID: 20201022065201.GJ1475@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 22, 2020 at 02:27:34PM +0800, Julien Rouhaud wrote:
> I'm a bit worried about this approach, as if I understand correctly
> this can lead to false positive reports. I've certainly seen systems
> with IO stalled for more than 500ms, so while this is not frequent
> this could still happen.

The possibility of false positives is not a new thing with this
feature as currently shaped. On HEAD, this code actually just does
one retry, without waiting at all for the operation potentially
happening in parallel to finish, so that's actually worse. And that's
assuming that the pd_lsn of the page did not get touched by a
corruption as we would simply miss a broken page. So, with a
non-locking approach, we limit ourselves to tweaking the number of
retries and some sleeps :(

I am not sure that increasing the sleep of 100ms is a good thing on
not-so-slow disks, but we could increase the number of retries. The
patch makes that easier to change at least. FWIW, I don't like that
this code, with a real risk of false positives, got merged to begin
with, and I think that other people share the same opinion, but it is
not like we can just remove it on a branch already released either..
And I am not sure if we have done such things in the past for stable
branches. If we were to do that, we could just make the operation a
no-op, and keep some traces of the grammar for compatibility.

> About the patch:
>
> + * doing this check, causing a false positive. If that
> + * happens, a page is retried once, with an error reported if
> + * the second attempt also fails.
>
> [...]
>
> + /* The verification of a page has failed, retry once */
> + if (block_attempts < PAGE_RETRY_THRESHOLD)
> + {

Oops. Thanks, I got that changed on my branch.
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-10-22 07:03:49 Re: abstract Unix-domain sockets
Previous Message Kyotaro Horiguchi 2020-10-22 06:48:46 Re: [Patch] Optimize dropping of relation buffers using dlist