Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Online verification of checksums
Date: 2019-03-06 02:42:21
Message-ID: CAOuzzgoMGsWx-_pJH6hLLs=_a91wa+POzyntsesnO3ajOm0MyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

On Tue, Mar 5, 2019 at 18:36 Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Mar 05, 2019 at 02:08:03PM +0100, Tomas Vondra wrote:
> > Based on quickly skimming that thread the main issue seems to be
> > deciding which files in the data directory are expected to have
> > checksums. Which is a valid issue, of course, but I was expecting
> > something about partial read/writes etc.
>
> I remember complaining about partial write handling as well for the
> base backup checks... There should be an email about it on the list,
> cannot find it now ;p
>
> > My understanding is that:
> >
> > (a) The checksum verification should not generate false positives (same
> > as for basebackup).
> >
> > (b) The partial reads do emit warnings, which might be considered false
> > positives I guess. Which is why I'm arguing for changing it to do the
> > same thing basebackup does, i.e. ignore this.
>
> Well, at least that's consistent... Argh, I really think that we
> ought to make the failures reported harder because that's easier to
> detect within a tool and some deployments set log_min_messages >
> WARNING so checksum failures would just be lost. For base backups we
> don't care much about that as files are just blindly copied so they
> could have torn pages, which is fine as that's fixed at replay. Now
> we are talking about a set of tools which could have reliable
> detection mechanisms for those problems.

I’m traveling but will try to comment more in the coming days but in
general I agree with Tomas on these items. Also, pg_basebackup has to
handle torn pages when it comes to checksums just like the verify tool
does, and having them be consistent (along with external tools) would
really be for the best, imv. I still feel like a retry of a short read
(try reading more to get the whole page..) would be alright and reading
until we hit eof and then moving on. I’m not sure it’s possible but I do
worry a bit that we might get a short read from a network file system or
something that isn’t actually at eof and then we would skip a significant
remaining portion of the file... another thought might be to stat the
file after we have opened it to see it’s length...

Just a few thoughts since I’m on my phone. Will try to write up something
more in a day or two.

Thanks!

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2019-03-06 02:45:01 Tab completion for SKIP_LOCKED option
Previous Message Michael Paquier 2019-03-06 02:36:40 Re: Online verification of checksums