Re: Online verification of checksums

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online verification of checksums
Date: 2019-03-02 22:00:31
Message-ID: 20190302220031.j7ayfoimgr42ofij@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-03-02 22:49:33 +0100, Tomas Vondra wrote:
>
>
> On 3/2/19 5:08 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Michael Banck (michael(dot)banck(at)credativ(dot)de) wrote:
> >> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> >>> On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >>> <michael(dot)banck(at)credativ(dot)de> wrote:
> >>>> I have added a retry for this as well now, without a pg_sleep() as well.
> >>>> This catches around 80% of the half-reads, but a few slip through. At
> >>>> that point we bail out with exit(1), and the user can try again, which I
> >>>> think is fine?
> >>>
> >>> Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> >>> robust at all.
> >>
> >> The chance that pg_verify_checksums hits a torn page (at least in my
> >> tests, see below) is already pretty low, a couple of times per 1000
> >> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> >> on. Otherwise, we now just issue a warning and skip the file (or so was
> >> the idea, see below), do you think that is not acceptable?
> >>
> >> I re-ran the tests (concurrent createdb/pgbench -i -s 50/dropdb and
> >> pg_verify_checksums in tight loops) with the current patch version, and
> >> I am seeing short reads very, very rarely (maybe every 1000th run) with
> >> a warning like:
> >>
> >> |1174
> >> |pg_verify_checksums: warning: could not read block 374 in file "data/base/18032/18045": read 4096 of 8192
> >> |pg_verify_checksums: warning: could not read block 375 in file "data/base/18032/18045": read 4096 of 8192
> >> |Files skipped: 2
> >>
> >> The 1174 is the sequence number, the first 1173 runs of
> >> pg_verify_checksums only skipped blocks.
> >>
> >> However, the fact it shows two warnings for the same file means there is
> >> something wrong here. It was continueing to the next block while I think
> >> it should just skip to the next file on read failures. So I have changed
> >> that now, new patch attached.
> >
> > I'm confused- if previously it was continueing to the next block instead
> > of doing the re-read on the same block, why don't we just change it to
> > do the re-read on the same block properly and see if that fixes the
> > retry, instead of just giving up and skipping..? I'm not necessairly
> > against skipping to the next file, to be clear, but I think I'd be
> > happier if we kept reading the file until we actually get EOF.
> >
> > (I've not looked at the actual patch, just read what you wrote..)
> >
>
> Notice that those two errors are actually for two consecutive blocks in
> the same file. So what probably happened is that postgres started to
> extend the page, and the verification tried to read the last page after
> the kernel added just the first 4kB filesystem page. Then it probably
> succeeded on a retry, and then the same thing happened on the next page.
>
> I don't think EOF addresses this, though - the partial read happens
> before we actually reach the end of the file.
>
> And re-reads are not a solution either, because the second read may
> still see only the first half, and then what - is it a permanent issue
> (in which case it's a data corruption), or an extension in progress?
>
> I wonder if we can simply ignore those errors entirely, if it's the last
> page in the segment? We can't really check the file is "complete"
> anyway, e.g. if you have multiple segments for a table, and the "middle"
> one is a page shorter, we'll happily ignore that during verification.
>
> Also, what if we're reading a file and it gets truncated (e.g. after
> vacuum notices the last few pages are empty)? Doesn't that have the same
> issue?

I gotta say, my conclusion from this debate is that it's simply a
mistake to do this without involvement of the server that can use
locking to prevent these kind of issues. It seems pretty absurd to me
to have hacky workarounds around partial writes of a live server, around
truncation, etc, even though the server has ways to deal with that.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-02 22:29:49 Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
Previous Message Tomas Vondra 2019-03-02 21:49:33 Re: Online verification of checksums