Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2018-09-18 15:45:36
Message-ID: 20180918154536.GE4184@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> please find attached version 2 of the patch.
>
> Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
> >
> > I've tested this in a tight loop (while true; do pg_verify_checksums -D
> > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> > done", which I already used to develop the original code in the fork and
> > which brought up a few bugs.
> >
> > I got one checksums verification failure this way, all others were
> > caught by the recheck (I've introduced a 500ms delay for the first ten
> > failures) like this:
> >
> > > pg_verify_checksums: checksum verification failed on first attempt in
> > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > > expected 5063
> > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > > verified ok on recheck
>
> I have now changed this from the pg_sleep() to a check against the
> checkpoint LSN as discussed upthread.

Ok.

> > However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> > failures like this:
> >
> > > pg_verify_checksums: short read of block 2644 in file
> > > "data1/base/16637/16650", got only 4096 bytes
> >
> > This is not strictly a verification failure, should we do anything about
> > this? In my fork, I am also rechecking on this[3] (and I am happy to
> > extend the patch that way), but that makes the code and the patch more
> > complicated and I wanted to check the general opinion on this case
> > first.
>
> I have added a retry for this as well now, without a pg_sleep() as well.

> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine? 

No, this is perfectly normal behavior, as is having completely blank
pages, now that I think about it. If we get a short read then I'd say
we simply check that we got an EOF and, in that case, we just move on.

> Alternatively, we could just skip to the next file then and don't make
> it count as a checksum failure.

No, I wouldn't count it as a checksum failure. We could possibly count
it towards the skipped pages, though I'm even on the fence about that.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2018-09-18 15:53:13 Re: Progress reporting for pg_verify_checksums
Previous Message Tom Lane 2018-09-18 15:38:57 Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi