Re: Offline enabling/disabling of data checksums

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Sergei Kornilov <sk(at)zsrv(dot)org>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-03-21 07:17:32
Message-ID: alpine.DEB.2.21.1903210745310.3843@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Bonjour Michaël,

> On Wed, Mar 20, 2019 at 05:46:32PM +0100, Fabien COELHO wrote:
>> I think that the motivation/risks should appear before the solution. "As xyz
>> ..., ...", or there at least the logical link should be outlined.
>>
>> It is not clear for me whether the following sentences, which seems specific
>> to "pg_rewind", are linked to the previous advice, which seems rather to
>> refer to streaming replication?
>
> Do you have a better idea of formulation?

I can try, but I must admit that I'm fuzzy about the actual issue. Is
there a problem on a streaming replication with inconsistent checksum
settings, or not?

You seem to suggest that the issue is more about how some commands or
backup tools operate on a cluster.

I'll reread the thread carefully and will make a proposal.

> Imagine for example a primary-standby with checksums disabled: [...]

Yep, that's cool.

>> Should not disabling in reverse order be safe? the checksum are not checked
>> afterwards?
>
> I don't quite understand your comment about the ordering. If all the
> standbys are destroyed first, then enabling/disabling checksums happens
> at a single place.

Sure. I was suggesting that disabling on replicated clusters is possibly
safer, but do not know the detail of replication & checksumming with
enough precision to be that sure about it.

>> After the reboot, some data files are not fully updated with their
>> checksums, although the controlfiles tells that they are. It should then
>> fail after a restart when a no-checksum page is loaded?
>>
>> What am I missing?
>
> Please note that we do that in other tools as well and we live fine
> with that as pg_basebackup, pg_rewind just to name two.

The fact that other commands are exposed to the same potential risk is not
a very good argument not to fix it.

> I am not saying that it is not a problem in some cases, but I am saying
> that this is not a problem that this patch should solve.

As solving the issue involves exchanging two lines and turning one boolean
parameter to true, I do not see why it should not be done. Fixing the
issue takes much less time than writing about it...

And if other commands can be improved fine with me.

> If we were to do something about that, it could make sense to make
> fsync_pgdata() smarter so as the control file is flushed last there, or
> define flush strategies there.

ISTM that this would not work: The control file update can only be done
*after* the fsync to describe the cluster actual status, otherwise it is
just a question of luck whether the cluster is corrupt on an crash while
fsyncing. The enforced order of operation, with a barrier in between, is
the important thing here.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-03-21 09:57:41 Re: Offline enabling/disabling of data checksums
Previous Message Michael Paquier 2019-03-21 07:13:55 Re: MSVC Build support with visual studio 2019