Re: Online verification of checksums

From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2019-02-05 11:29:53
Message-ID: 1549366193.796.9.camel@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> On 2/5/19 8:01 AM, Andres Freund wrote:
> > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > I'm wondering (possibly again) about the existing early exit if one block
> > > > > > cannot be read on retry: the command should count this as a kind of bad
> > > > > > block, proceed on checking other files, and obviously fail in the end, but
> > > > > > having checked everything else and generated a report. I do not think that
> > > > > > this condition warrants a full stop. ISTM that under rare race conditions
> > > > > > (eg, an unlucky concurrent "drop database" or "drop table") this could
> > > > > > happen when online, although I could not trigger one despite heavy testing,
> > > > > > so I'm possibly mistaken.
> > > > >
> > > > > This seems like a defensible judgement call either way.
> > > >
> > > > Right now we have a few tests that explicitly check that
> > > > pg_verify_checksums fail on broken data ("foo" in the file). Those
> > > > would then just get skipped AFAICT, which I think is the worse behaviour
> > > > , but if everybody thinks that should be the way to go, we can
> > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > >
> > > > Thoughts?
> > >
> > > My point is that it should fail as it does, only not immediately (early
> > > exit), but after having checked everything else. This mean avoiding calling
> > > "exit(1)" here and there (lseek, fopen...), but taking note that something
> > > bad happened, and call exit only in the end.
> >
> > I can see both as being valuable (one gives you a more complete picture,
> > the other a quicker answer in scripts). For me that's the point where
> > it's the prerogative of the author to make that choice.

Personally, I would prefer to keep it as simple as possible for now and
get this patch committed; in my opinion the behaviour is already like
this (early exit on corrupt files) so I don't think the online
verification patch should change this.

If we see complaints about this, then I'd be happy to change it
afterwards.

> Why not make this configurable, using a command-line option?

I like this even less - this tool is about verifying checksums, so
adding options on what to do when it encounters broken pages looks out-
of-scope to me. Unless we want to say it should generally abort on the
first issue (i.e. on wrong checksums as well).

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-02-05 11:36:58 Re: Feature: temporary materialized views
Previous Message Robert Haas 2019-02-05 10:41:55 Re: What happens if checkpoint haven't completed until the next checkpoint interval or max_wal_size?