Re: Offline enabling/disabling of data checksums

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Sergei Kornilov <sk(at)zsrv(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-03-14 04:39:36
Message-ID: 20190314043936.GH3493@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 13, 2019 at 08:56:33PM +0900, Michael Paquier wrote:
> On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote:
> > The attached patch should do the above, on top of Michael's last
> > patchset.
>
> What you are doing here looks like a good defense in itself.

More thoughts on that, and here is a short summary of the thread.

+ /* Check if control file has changed */
+ if (controlfile_last_updated != ControlFile->time)
+ {
+ fprintf(stderr, _("%s: control file has changed since startup\n"), progname);
+ exit(1);
+ }
Actually, under the conditions discussed on this thread that Postgres
is started in parallel of pg_checksums, imagine the following
scenario:
- pg_checksums starts to enable checksums, it reads a block to
calculate its checksum, then calculates it.
- Postgres has been started in parallel, writes the same block.
- pg_checksums finishes the block calculation, writes back the block
it has just read.
- Postgres stops, some data is lost.
- At the end of pg_checksums, we complain that the control file has
been updated since the start of pg_checksums.
I think that we should be way more noisy about this error message
document properly that Postgres should not be started while checksums
are enabled. Basically, I guess that it should mention that there is
a risk of corruption because of this parallel operation.

Hence, based on what I could read on this thread, we'd like to have
the following things added to the core patch set:
1) When enabling checksums, fsync the data folder. Then update the
control file, and finally fsync the control file (+ flush of the
parent folder to make the whole durable). This way a host crash never
actually means that we finish in an inconsistent state.
2) When checksums are disabled, update the control file, then fsync
it + its parent folder.
3) Add tracking of the control file data, and complain loudly before
trying to update the file if there are any inconsistencies found.
4) Document with a big-fat-red warning that postgres should not be
worked on while the tool is enabling or disabling checksums.

There is a part discussed about standbys and primaries with not the
same checksum settings, but I commented on that in [1].

There is a secondary bug fix to prevent the tool to run if the data
folder has been initialized with a block size different than what
pg_checksums has been compiled with in [2]. The patch looks good,
still the error message could be better per my lookup.

[1]: https://www.postgresql.org/message-id/20190314002342.GC3493@paquier.xyz
[2]: https://www.postgresql.org/message-id/20190313224742.GA3493@paquier.xyz

Am I missing something?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-14 04:41:27 Re: pgsql: Add support for hyperbolic functions, as well as log10().
Previous Message Takuma Hoshiai 2019-03-14 04:37:00 Proposal to suppress errors thrown by to_reg*()