Re: [patch] Fix checksum verification in base backups for zero page headers

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] Fix checksum verification in base backups for zero page headers
Date: 2020-10-22 06:27:34
Message-ID: CAOBaU_Z_zpHB3mj2Sj6SxAwxw1GRprFuxLVDcJ+ZZco+F5kvww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 22, 2020 at 9:25 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Thu, Oct 22, 2020 at 12:47:03AM +0300, Anastasia Lubennikova wrote:
> > Thank you. I always forget about this. Do we have any checklist for such
> > changes, that patch authors and reviewers can use?
>
> Not really. That's more a habit than anything else where any
> non-static routine that we publish could be used by some out-of-core
> code, so maintaining a pure API compatibility on stable branches is
> essential.
>
> > We can also read such pages via shared buffers to be 100% sure.
>
> Yeah, but this has its limits as well. One can use
> ignore_checksum_failure, but this can actually be very dangerous as
> you can finish by loading into shared buffers a page that has a header
> thought as sane but with a large portion of its page randomly
> corrupted, spreading corruption around and leading to more fancy
> logic failures in Postgres, with more panic from customers. Not using
> ignore_checksum_failure is safer, but it makes an analyze of the
> damages for a given file harder as things would stop at the first
> failure of a file with a seqscan. pg_prewarm can help here, but
> that's not the goal of the tool to do that either.
>
> We definitely need a different approach that guarantees that a page is
> correctly locked with no concurrent I/O while checked on retry, and I
> am looking at that for the SQL-level check. That's not something I
> would do in v13 though, but we can make the existing logic much more
> reliable with a set of fixed retries and some sleeps in-between. A
> maximum of 5 retries with 100ms seems like a good balance seen from
> here, but that's not a perfect science of course depending on the
> hardware latency.
>
> This has been a sensitive topic during the stability period of v13
> with many committers commenting on the issue, so I'd rather be very
> careful that there are no objections to what's published here, and in
> consequence I am going to ping the other thread on the matter. For
> now I have the attached to address the zero-case and the random LSN
> case.

I'm a bit worried about this approach, as if I understand correctly
this can lead to false positive reports. I've certainly seen systems
with IO stalled for more than 500ms, so while this is not frequent
this could still happen.

About the patch:

+ * doing this check, causing a false positive. If that
+ * happens, a page is retried once, with an error reported if
+ * the second attempt also fails.

[...]

+ /* The verification of a page has failed, retry once */
+ if (block_attempts < PAGE_RETRY_THRESHOLD)
+ {

Both those comments seem to refer to the previous "retry only once"
approach. Apart from that and my concerns on the new heuristics, the
patch looks good, +1 for PageIsVerifiedExtended()!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-10-22 06:33:49 Re: [Patch] Optimize dropping of relation buffers using dlist
Previous Message Dilip Kumar 2020-10-22 06:21:57 Re: Track statistics for streaming of in-progress transactions