Re: Offline enabling/disabling of data checksums

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Sergei Kornilov <sk(at)zsrv(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-03-22 00:27:27
Message-ID: 20190322002727.GJ20192@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 21, 2019 at 08:17:32AM +0100, Fabien COELHO wrote:
> I can try, but I must admit that I'm fuzzy about the actual issue. Is there
> a problem on a streaming replication with inconsistent checksum settings, or
> not?
>
> You seem to suggest that the issue is more about how some commands or backup
> tools operate on a cluster.

Yes. That's what I am writing about. Imagine for example this case
with pg_rewind:
- primary has checksums enabled.
- standby has checksums disabled.
- a hard crash of the primary happens, there is a failover to the
standby which gets promoted.
- The primary's host is restarted, it is started and stopped once
cleanly to have a clear point in its past timeline where WAL forked
thanks to the generation of at least the shutdown checkpoint generated
by the clean stop.
- pg_rewind is run, copying some pages from the promoted standby,
which don't have checksums, to the primary with checksums enabled, and
causing some pages to have an incorrect checksum.

There is another tool I know of which is called pg_rman, which is a
backup tool able to take incremental backups in the shape of a delta
of relation blocks. Then imagine the following:
- One instance of Postgres runs, has checksums disabled.
- pg_rman takes a full backup of it.
- Checksums are enabled on this instance.
- An incremental backup from the previous full backup point is taken.
If I recall correctly pg_rman takes a copy of the new control file as
well, which tracks checksums as being enabled.
- A crash happens, the data folder is dead.
- Rollback to the previous backup is done, and we restore up to a
point after the incremental backup.
- And you finish with a cluster which has checksums enabled, but as
the initial full backup had checksums disabled, not all the pages may
be in a correct state.

So I think that it makes sense to tell to be careful within the
documentation, but being too much careful in the tool discards also
many possibilities (see the example of the clean failover where it is
possible to enable checksums with no actual downtime). And this part
has a lot of value.

> ISTM that this would not work: The control file update can only be done
> *after* the fsync to describe the cluster actual status, otherwise it is
> just a question of luck whether the cluster is corrupt on an crash while
> fsyncing. The enforced order of operation, with a barrier in between, is the
> important thing here.

Done the switch for this case. For pg_rewind actually I think that
this is an area where its logic could be improved a bit. So first
the data folder is synced, and then the control file is updated. It
took less time to change the code than to write this paragraph,
including the code compilation and one run of the TAP tests,
confirmed.

I have added in the docs a warning about a host crash while doing the
operation, with a recommendation to check the state of the checksums
on the data folder should it happen, and the previous portion of the
docs about clusters. Your suggestion sounds adapted. I would be
tempted to add a bigger warning in pg_rewind or pg_basebackup about
that, but that's a different story for another time.

Does that look fine to you?
--
Michael

Attachment Content-Type Size
v9-0001-Add-options-to-enable-and-disable-checksums-in-pg.patch text/x-diff 19.3 KB
v9-0002-Add-option-N-no-sync-to-pg_checksums.patch text/x-diff 6.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-03-22 00:36:11 Re: PostgreSQL pollutes the file system
Previous Message Amit Langote 2019-03-22 00:21:56 Re: ToDo: show size of partitioned table