Re: Online verification of checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Online verification of checksums
Date: 2019-03-18 08:11:29
Message-ID: CAOuzzgr26WY3izp8_svYHVZp0ZUmyp5WiuJB7u_2V5PHHcheHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

On Mon, Mar 18, 2019 at 15:52 Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Hi.
>
> Am Montag, den 18.03.2019, 03:34 -0400 schrieb Stephen Frost:
> > * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> > > Am Montag, den 18.03.2019, 02:38 -0400 schrieb Stephen Frost:
> > > > * Michael Paquier (michael(at)paquier(dot)xyz) wrote:
> > > > > On Mon, Mar 18, 2019 at 01:43:08AM -0400, Stephen Frost wrote:
> > > > > > To be clear, I agree completely that we don't want to be
> reporting false
> > > > > > positives or "this might mean corruption!" to users running the
> tool,
> > > > > > but I haven't seen a good explaination of why this needs to
> involve the
> > > > > > server to avoid that happening. If someone would like to point
> that out
> > > > > > to me, I'd be happy to go read about it and try to understand.
> > > > >
> > > > > The mentions on this thread that the server has all the facility in
> > > > > place to properly lock a buffer and make sure that a partial read
> > > > > *never* happens and that we *never* have any kind of false
> positives,
> > > >
> > > > Uh, we are, of course, going to have partial reads- we just need to
> > > > handle them appropriately, and that's not hard to do in a way that we
> > > > never have false positives.
> > >
> > > I think the current patch (V13 from
> https://www.postgresql.org/message-i
> > > d/1552045881(dot)4947(dot)43(dot)camel(at)credativ(dot)de) does that, modulo possible
> bugs.
> >
> > I think the question here is- do you ever see false positives with this
> > latest version..? If you are, then that's an issue and we should
> > discuss and try to figure out what's happening. If you aren't seeing
> > false positives, then it seems like we're done here, right?
>
> What do you mean with false positives here? I've never seen a bogus
> checksum failure, i.e. pg_checksums claiming some checksum is wrong
> cause it only read half of a block or a torn page.
>
> I do see sporadic partial reads and they get treated by the re-check
> logic and (if that is not enough) get tallied up as a skipped block in
> the end. Is that a false positive in your book?

No, that’s clearer not a false positive.

[...]
>
> > > I have now rebased that patch on top of the pg_verify_checksums ->
> > > pg_checksums renaming, see attached.
> >
> > Thanks for that. Reading through the code though, I don't entirely
> > understand why we're making things complicated for ourselves by trying
> > to seek and re-read the entire block, specifically this:
>
> [...]
>
> > I would think that we could just do:
> >
> > insert_location = 0;
> > r = read(BLCKSIZE - insert_location);
> > if (r < 0) error();
> > if (r == 0) EOF detected, move to next
> > if (r < (BLCKSIZE - insert_location)) {
> > insert_location += r;
> > continue;
> > }
> >
> > At this point, we should have a full block, do our checks...
>
> Well, we need to read() into some buffer which you have ommitted.

Surely there’s a buffer the read in the existing code is passing in, you
just need to offset by the current pointer, sorry for not being clear.

In other words the read would look more like:

read(fd,buf + insert_ptr, BUFSZ - insert_ptr)

And then you have to reset insert_ptr once you have a full block.

So if we had a short read, and then read the rest of the block via
> (BLCKSIZE - insert_location) wouldn't we have to read that in a second
> buffer and then join the two in order to compute the checksum? That
> does not sounds simpler to me than just re-reading the block entirely.

No, just read into your existing buffer at the point where the prior
partial read left off...

> Have you seen cases where the kernel will actually return a partial read
> > for something that isn't at the end of the file, and where you could
> > actually lseek past that point and read the next block? I'd be really
> > curious to see that if you can reproduce it... I've definitely seen
> > empty pages come back with a claim that the full amount was read, but
> > that's a very different thing.
>
> Well, I've seen partial reads and I have seen very rarely that it will
> continue to read another block afterwards. If the relation is being
> extended while we check it, it sounds plausible that another block could
> be written before we get to read EOF on the next read() after a partial
> read() so that does not sounds like a bug to me either.

Right, absolutely you can have a partial read during a relation extension
and then come back around and do another read and discover more data,
that’s entirely reasonable and I’ve seen it happen too.

I might be misunderstanding your question though?

Yes, the question was more like this: have you ever seen a read return a
partial result when you know you’re in the middle somewhere of an existing
file and the length of the file hasn’t been changed by something else..? I
can’t say that I have, when reading from regular files, even in
kernel-error type of conditions due to hardware issues, but I’m open to
being told I’m wrong... in such a case though I would still expect an
error on a subsequent read, which would work just fine for our case. If the
kernel just decides to return a zero in that case then I don’t know that
there’s really anything we can do about that because that seems like it
would be pretty clearly broken results from the kernel and that’s out of
scope for this.

Apologies if this isn’t clear, on my phone now.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-03-18 08:13:01 Re: Offline enabling/disabling of data checksums
Previous Message Kyotaro HORIGUCHI 2019-03-18 08:00:24 Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead