Re: Offline enabling/disabling of data checksums

From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2019-01-08 11:05:43
Message-ID: 1546945543.32387.8.camel@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Am Donnerstag, den 27.12.2018, 12:26 +0100 schrieb Fabien COELHO:
> > > For enable/disable, while the command is running, it should mark the
> > > cluster as opened to prevent an unwanted database start. I do not see
> > > where this is done.
> > >
> > > You have pretty much the same class of problems if you attempt to
> > > start a cluster on which pg_rewind or the existing pg_verify_checksums
> > > is run after these have scanned the control file to make sure that
> > > they work on a cleanly-stopped instance. [...]
> >
> > I think it comes down to what the outcome is. If we're going to end up with
> > a corrupt database (e.g. one where checksums aren't set everywhere but they
> > are marked as such in pg_control) then it's not acceptable. If the only
> > outcome is the tool gives an error that's not an error and if re-run it's
> > fine, then it's a different story.
>
> ISTM that such an outcome is indeed a risk, as a starting postgres could
> update already checksummed pages without putting a checksum. It could be
> even worse, although with a (very) low probability, with updates
> overwritten on a race condition between the processes. In any case, no
> error would be reported before much later, with invalid checksums or
> inconsistent data, or undetected forgotten committed data.

One difference between pg_rewind and pg_checksums is that the latter
potentially runs for a longer time (or rather a non-trivial amount of
time, compared to pg_rewind), so the margin of error of another DBA
saying "oh, that DB is down, let me start it again" might be much
higher.

The question is how to reliably do this in an acceptable way? Just
faking a postmaster.pid sounds pretty hackish to me, do you have any
suggestions here?

The alternative would be to document that it needs to be made sure that
the database is not started up during enabling of checksums, yielding to
pilot error.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Padam Chopra 2019-01-08 11:34:52 GCI-2019 Mentoring
Previous Message Peter Eisentraut 2019-01-08 10:30:56 Re: Displaying and dumping of table access methods