Re: Offline enabling/disabling of data checksums

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-03-11 04:53:23
Message-ID: 20190311045323.GE1818@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 27, 2019 at 07:59:31AM +0100, Fabien COELHO wrote:
> Hallo Michael,

Okay, let's move on with these patches!

> The renaming implies quite a few changes (eg in the documentation,
> makefiles, tests) which warrants a review, so it should be a patch. Also,
> ISTM that the renaming only make sense when adding the enable/disable
> feature, so I'd say that it belongs to this patch. Opinions?

I have worked on the last v4 sent by Michael B, finishing with the
attached after review and addressed the last points raised by Fabien.
The thing is that I have been rather unhappy with a couple of things
in what was proposed, so I have finished by modifying quite a couple
of areas. The patch set is now splitted as I think is suited for
commit, with the refactoring and renaming being separated from the
actual feature:
- 0001 if a patch to refactor the routine for the control file
update. I have made it backend-aware, and we ought to be careful with
error handling, use of fds and such, something that v4 was not very
careful about.
- 0002 renames pg_verify_checksums to pg_checksums with a
straight-forward switch. Docs as well as all references to
pg_verify_checksums are updated.
- 0003 adds the new options --check, --enable and --disable, with
--check being the default as discussed.
- 0004 adds a -N/--no-sync which I think is nice for consistency with
other tools. That's also useful for the tests, and was not discussed
until now on this thread.

> Patch applies cleanly, compiles, global & local "make check" are ok.
>
> Doc: "enable, disable or verifies" -> "checks, enables or disables"?
> Spurious space: "> ." -> ">.".
>
> As checksum are now checked, the doc could use "check" instead of "verify",
> especially if there is a rename and the "verify" word disappears.

That makes sense. I have fixed these, and simplified the docs a bit
to have a more simple page.

> I'd be less terse when documenting actions, eg: "Disable checksums" ->
> "Disable checksums on cluster."

The former is fine in my opinion.

> Doc should state that checking is the default action, eg "Check checksums on
> cluster. This is the default action."

Check.

> Help string could say that -c is the default action. There is a spurious
> help line remaining from the previous "--action" implementation.

Yeah, we should. Added.

> open: I'm positively unsure about ?: priority over |, and probably not the
> only one, so I'd add parentheses around the former.

Yes, I agree that the current code is hard to decrypt. So reworked
with a separate variable in scan_file, and added extra parenthesis in
the part which updates the control file.

> I'm at odds with the "postmaster.pid" check, which would not prevent an
> issue if a cluster is started with "postmaster". I still think that the
> enabling-in-progress should be stored in the cluster state.
>
> ISTM that the cluster read/update cycle should lock somehow the control file
> being modified. However other commands do not seem to do something about it.

I am still not on board for adding more complexity in this area, at
least not for this stuff and for the purpose of this thread, because
this can happen at various degrees for various configurations for ages
and not only for checksums. Also, I think that we still have to see
users complain about that. Here are some scenarios where this can
happen:
- A base backup partially written. pg_basebackup limits this risk but
it could still be possible to see a case where partially-written data
folder. And base backups are around for many years.
- pg_rewind, and the tool is in the tree since 9.5, the tool is
actually available on github since 9.3.

> I do not think that enabling if already enabled or disabling or already
> disable should exit(1), I think it is a no-op and should simply exit(0).

We already issue exit(1) when attempting to verify checksums on a
cluster where they are disabled. So I agree with Michael B's point of
Issuing an error in such cases. I think also that this makes handling
for operators easier.

> About tests: I'd run a check on a disabled cluster to check that the command
> fails because disabled.

Makes sense. Added. We need a test also for the case of successive
runs with --enable.

Here are also some notes from my side.
- There was no need to complicate the synopsis of the docs.
- usage() included still references to --action and indentation was a
bit too long at the top.
- There were no tests for disabling checksums, so I have added some.
- We should check that the combination of --enable and -r fails.
- Tests use only long options, that's better for readability.
- Improved comments in tests.
- Better to check for "data_checksum_version > 0" if --enable is
used. That's more portable long-term if more checksum versions are
added.
- The check on postmaster.pid is actually not necessary as we already
know that the cluster has been shutdown cleanly per the state of the
control file. I agree that there could be a small race condition
here, and we could discuss that in a different thread if need be as
such things could be improved for other frontend tools as well. For
now I am taking the most simple approach.

(Still need to indent the patches before commit, but that's a nit.)
--
Michael

Attachment Content-Type Size
0001-Refactor-routine-for-update-of-control-file.patch text/x-diff 7.2 KB
0002-Rename-pg_verify_checksums-to-pg_checksums.patch text/x-diff 16.7 KB
0003-Add-options-to-enable-and-disable-checksums-in-pg_ch.patch text/x-diff 17.8 KB
0004-Add-option-N-no-sync-to-pg_checksums.patch text/x-diff 5.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2019-03-11 05:08:51 Re: Oddity with parallel safety test for scan/join target in grouping_planner
Previous Message Amit Langote 2019-03-11 04:30:11 Re: Should we add GUCs to allow partition pruning to be disabled?