Re: Checksums by default?

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums by default?
Date: 2017-01-24 00:14:02
Message-ID: CAM3SWZTc+4QjysO-Op4ui8hZJro3QG0fN1MFj1NtMuVHQY1sew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 21, 2017 at 9:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Not at all; I just think that it's not clear that they are a net win
> for the average user, and so I'm unconvinced that turning them on by
> default is a good idea. I could be convinced otherwise by suitable
> evidence. What I'm objecting to is turning them on without making
> any effort to collect such evidence.

+1

One insight Jim Gray has in the classic paper "Why Do Computers Stop
and What Can Be Done About It?" [1] is that fault-tolerant hardware is
table stakes, and so most failures are related to operator error, and
to a lesser extent software bugs. The paper is about 30 years old.

I don't recall ever seeing a checksum failure on a Heroku Postgres
database, even though they were enabled as soon as the feature became
available. I have seen a few corruption problems brought to light by
amcheck, though, all of which were due to bugs in software.
Apparently, before I joined Heroku there were real reliability
problems with the storage subsystem that Heroku Postgres runs on (it's
a pluggable storage service from a popular cloud provider -- the
"pluggable" functionality would have made it fairly novel at the
time). These problems were something that the Heroku Postgres team
dealt with about 6 years ago. However, anecdotal evidence suggests
that the reliability of the same storage system *vastly* improved
roughly a year or two later. We still occasionally lose drives, but
drives seem to fail fast in a fashion that lets us recover without
data loss easily. In practice, Postgres checksums do *not* seem to
catch problems. That's been my experience, at least.

Obviously every additional check helps, and it may be something we can
do without any appreciable downside. I'd like to see a benchmark.

[1] http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jia Yu 2017-01-24 00:19:22 Re: IndexBuild Function call fcinfo cannot access memory
Previous Message Jim Nasby 2017-01-24 00:12:55 Re: GSoC 2017