Re: Offline enabling/disabling of data checksums

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Sergei Kornilov <sk(at)zsrv(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-03-14 14:26:53
Message-ID: CABUevExjCOKDRX=B6fBkoYs7Q2jYJLZS4Bxh1pi0b2_BV-7DEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 14, 2019 at 5:39 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Wed, Mar 13, 2019 at 08:56:33PM +0900, Michael Paquier wrote:
> > On Wed, Mar 13, 2019 at 12:09:24PM +0100, Michael Banck wrote:
> > > The attached patch should do the above, on top of Michael's last
> > > patchset.
> >
> > What you are doing here looks like a good defense in itself.
>
> More thoughts on that, and here is a short summary of the thread.
>
> + /* Check if control file has changed */
> + if (controlfile_last_updated != ControlFile->time)
> + {
> + fprintf(stderr, _("%s: control file has changed since
> startup\n"), progname);
> + exit(1);
> + }
> Actually, under the conditions discussed on this thread that Postgres
> is started in parallel of pg_checksums, imagine the following
> scenario:
> - pg_checksums starts to enable checksums, it reads a block to
> calculate its checksum, then calculates it.
> - Postgres has been started in parallel, writes the same block.
> - pg_checksums finishes the block calculation, writes back the block
> it has just read.
> - Postgres stops, some data is lost.
> - At the end of pg_checksums, we complain that the control file has
> been updated since the start of pg_checksums.
> I think that we should be way more noisy about this error message
> document properly that Postgres should not be started while checksums
> are enabled. Basically, I guess that it should mention that there is
> a risk of corruption because of this parallel operation.
>
> Hence, based on what I could read on this thread, we'd like to have
> the following things added to the core patch set:
> 1) When enabling checksums, fsync the data folder. Then update the
> control file, and finally fsync the control file (+ flush of the
> parent folder to make the whole durable). This way a host crash never
> actually means that we finish in an inconsistent state.
> 2) When checksums are disabled, update the control file, then fsync
> it + its parent folder.
> 3) Add tracking of the control file data, and complain loudly before
> trying to update the file if there are any inconsistencies found.
> 4) Document with a big-fat-red warning that postgres should not be
> worked on while the tool is enabling or disabling checksums.
>

Given that the failure is data corruption, I don't think big fat warning is
enough. We should really make it impossible to start up the postmaster by
mistake during the checksum generation. People don't read the documentation
until it's too late. And it might not even be under their control - some
automated tool might go in and try to start postgres, and boom, corruption.

One big-hammer method could be similar to what pg_upgrade does --
temporarily rename away the controlfile so postgresql can't start, and when
done, put it back.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2019-03-14 14:28:15 Re: Offline enabling/disabling of data checksums
Previous Message Magnus Hagander 2019-03-14 14:23:59 Re: Offline enabling/disabling of data checksums