Re: Online verification of checksums

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Online verification of checksums
Date: 2018-10-30 17:22:52
Message-ID: alpine.DEB.2.21.1810301754020.9086@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hallo Michael,

Patch v6 applies cleanly, compiles, local make check is ok.

>> My current opinion is that when offline some errors are not admissible,
>> whereas the same errors are admissible when online because they may be due
>> to the ongoing database processing, so the behavior should not be strictly
>> the same.
>
> Indeed, the recently-added pg_verify_checksums testsuite

A welcome addition!

> adds a few files with just 'foo' in them and with V5 of the patch,
> pg_verify_checksums no longer bails out with an error on those.

> I have now re-added the retry logic for partially-read pages, so that it
> bails out if it reads a page partially twice. This makes the testsuite
> work again.
>
> I am not convinced we need to differentiate further between online and
> offline operation, can you explain in more detail which other
> differences are ok in online mode and why?

For instance the "file/directory was removed" do not look okay at all when
offline, even if unlikely. Moreover, the checks hides the error message
and is fully silent in this case, while it was not beforehand on the
same error when offline.

The "check if page was modified since checkpoint" does not look useful
when offline. Maybe it lacks a comment to say that this cannot (should not
?) happen when offline, but even then I would not like it to be true: ISTM
that no page should be allowed to be skipped on the checkpoint condition
when offline, but it is probably ok to skip with the new page test, which
make me still think that they should be counted and reported separately,
or at least the checkpoint skip test should not be run when offline.

When offline, the retry logic does not make much sense, it should complain
directly on the first error? Also, I'm unsure of the read & checksum retry
logic *without any delay*.

>> This might suggest some option to tell the command that it should work in
>> online or offline mode, so that it may be stricter in some cases. The
>> default may be one of the option, eg the stricter offline mode, or maybe
>> guessed at startup.
>
> If we believe the operation should be different, the patch removes the
> "is cluster online?" check (as it is no longer necessary), so we could
> just replace the current error message with a global variable with the
> result of that check and use it where needed (if any).

That could let open the issue of someone starting the check offline, and
then starting the database while it is not finished. Maybe it is not worth
sweating about such a narrow use case.

If operations are to be different, and it seems to me they should be, I'd
suggest (1) auto detect default based one the existing "is cluster online"
code, (2) force options, eg --online vs --offline, which would complain
and exit if the cluster is not in the right state on startup.

I'd suggest to add a failing checksum online test, if possible. At least a
"foo" file? It would also be nice if the test could apply on an active
database, eg with a low-rate pgbench running in parallel to the
verification, but I'm not sure how easy it is to add such a thing.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-10-30 17:52:54 Re: replication_slots usability issue
Previous Message David G. Johnston 2018-10-30 16:14:17 Re: Constraint documentation