Re: Checksums by default?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums by default?
Date: 2017-02-11 16:06:59
Message-ID: CA+TgmoZRG_Vik+giTBAOgRCPtkb5tC0AOauoyGc8=Kpjqdhvgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 10, 2017 at 7:38 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Incidentally, I've been dealing with a checksum failure reported by a
> customer last week, and based on the experience I tend to agree that we
> don't have the tools needed to deal with checksum failures. I think such
> tooling should be a 'must have' for enabling checksums by default.
>
> In this particular case the checksum failure is particularly annoying
> because it happens during recovery (on a standby, after a restart), during
> startup, so FATAL means shutdown.
>
> I've managed to inspect the page in different way (dd and pageinspect from
> another instance), and it looks fine - no obvious data corruption, the only
> thing that seems borked is the checksum itself, and only three consecutive
> bits are flipped in the checksum. So this doesn't seem like a "stale
> checksum" - hardware issue is a possibility (the machine has ECC RAM
> though), but it might just as easily be a bug in PostgreSQL, when something
> scribbles over the checksum due to a buffer overflow, just before we write
> the buffer to the OS. So 'false failures' are not entirely impossible thing.
>
> And no, backups may not be a suitable solution - the failure happens on a
> standby, and the page (luckily) is not corrupted on the master. Which means
> that perhaps the standby got corrupted by a WAL, which would affect the
> backups too. I can't verify this, though, because the WAL got removed from
> the archive, already. But it's a possibility.
>
> So I think we're not ready to enable checksums by default for everyone, not
> until we can provide tools to deal with failures like this (I don't think
> users will be amused if we tell them to use 'dd' and inspect the pages in a
> hex editor).
>
> ISTM the way forward is to keep the current default (disabled), but to allow
> enabling checksums on the fly. That will mostly fix the issue for people who
> actually want checksums but don't realize they need to enable them at initdb
> time (and starting from scratch is not an option for them), are running on
> good hardware and are capable of dealing with checksum errors if needed,
> even without more built-in tooling.
>
> Being able to disable checksums on the fly is nice, but it only really
> solves the issue of extra overhead - it does really help with the failures
> (particularly when you can't even start the database, because of a checksum
> failure in the startup phase).
>
> So, shall we discuss what tooling would be useful / desirable?

FWIW, I appreciate this analysis and I think it's exactly the kind of
thing we need to set a strategy for moving forward.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-02-11 16:11:05 Re: Parallel Index Scans
Previous Message Andrea Urbani 2017-02-11 14:56:19 Re: [ patch ] pg_dump: new --custom-fetch-table and --custom-fetch-value parameters