From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Michael Banck <michael(dot)banck(at)credativ(dot)de> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Online verification of checksums |
Date: | 2019-03-06 17:33:49 |
Message-ID: | CA+TgmoY3xhRD1UV5RJXDeiz=HSwmy7i8o47vKMzM0xFwPVg5iQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael(dot)banck(at)credativ(dot)de> wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > <michael(dot)banck(at)credativ(dot)de> wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?
Yeah. Consider a paranoid customer with 100 clusters who runs this
every day on every cluster. They're going to see failures every day
or three and go ballistic.
I suspect that better retry logic might help here. I mean, I would
guess that 10 retries at 1 second intervals or something of that sort
would be enough to virtually eliminate false positives while still
allowing us to report persistent -- and thus real -- problems. But if
even that is going to produce false positives with any measurable
probability different from zero, then I think we have a problem,
because I neither like a verification tool that ignores possible signs
of trouble nor one that "cries wolf" when things are fine.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2019-03-06 17:35:07 | Re: patch to allow disable of WAL recycling |
Previous Message | Robert Haas | 2019-03-06 17:26:58 | Re: Online verification of checksums |