Re: Online verification of checksums

From: David Steele <david(at)pgmasters(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2018-09-18 17:52:03
Message-ID: 47e26e3d-989f-b034-f2fc-926b67cc22bf@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/18/18 11:45 AM, Stephen Frost wrote:
> * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:

>> I have added a retry for this as well now, without a pg_sleep() as well.
>
>> This catches around 80% of the half-reads, but a few slip through. At
>> that point we bail out with exit(1), and the user can try again, which I
>> think is fine? 
>
> No, this is perfectly normal behavior, as is having completely blank
> pages, now that I think about it. If we get a short read then I'd say
> we simply check that we got an EOF and, in that case, we just move on.
>
>> Alternatively, we could just skip to the next file then and don't make
>> it count as a checksum failure.
>
> No, I wouldn't count it as a checksum failure. We could possibly count
> it towards the skipped pages, though I'm even on the fence about that.

+1 for it not being a failure. Personally I'd count it as a skipped
page, since we know the page exists but it can't be verified.

The other option is to wait for the page to stabilize, which doesn't
seem like it would take very long in most cases -- unless you are doing
this test from another host with shared storage. Then I would expect to
see all kinds of interesting torn pages after the last checkpoint.

Regards,
--
-David
david(at)pgmasters(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-09-18 18:35:54 Re: Code of Conduct
Previous Message Hironobu SUZUKI 2018-09-18 17:15:51 Re: pgbench - add pseudo-random permutation function