Re: Online enabling of checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Online enabling of checksums
Date: 2018-04-06 23:51:26
Message-ID: 20180406235126.d4sg4dtgicdpucnj@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2018-04-07 01:27:13 +0200, Daniel Gustafsson wrote:
> > On 07 Apr 2018, at 01:13, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote:
> >>> I'm fairly certain that the bug here is a simple race condition in the
> >>> test (not the main code!):
> >>
> >> I wonder if it may perhaps be a case of both?
> >
> > See my other message about the atomic fallback bit.
>
> Yep, my MUA pulled it down just as I had sent this. Thanks for confirming my
> suspicion.

But note it fails because the code using it is WRONG. There's a reason
there's "unlocked" in the name. But even leaving that aside, it probably
*still* be wrong if it were locked.

It seems *extremely* dubious that we'll allow to re-enable the checksums
while a worker is still doing stuff for the old cycle in the
background. Consider what happens if the checksum helper is currently
doing RequestCheckpoint() (something that can certainly take a *LONG*)
while. Another process disables checksums. Pages get written out
without checksums. Yet another process re-enables checksums. Helper
process does SetDataChecksumsOn(). Which succeeds because

if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION)
{
LWLockRelease(ControlFileLock);
elog(ERROR, "Checksums not in inprogress mode");
}

succeeds. Boom. Cluster with partially set checksums but marked as
valid.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-04-06 23:52:09 Re: [PATCH] Update README for Resource Owners
Previous Message Chapman Flack 2018-04-06 23:46:25 [PATCH] Update README for Resource Owners