Re: Checksums by default?

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums by default?
Date: 2017-02-20 16:50:26
Message-ID: 956ac362-00b3-839f-efd3-41bf3d3f64d9@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/11/2017 01:38 AM, Tomas Vondra wrote:
>
> Incidentally, I've been dealing with a checksum failure reported by a
> customer last week, and based on the experience I tend to agree that we
> don't have the tools needed to deal with checksum failures. I think such
> tooling should be a 'must have' for enabling checksums by default.
>
> In this particular case the checksum failure is particularly annoying
> because it happens during recovery (on a standby, after a restart),
> during startup, so FATAL means shutdown.
>
> I've managed to inspect the page in different way (dd and pageinspect
> from another instance), and it looks fine - no obvious data corruption,
> the only thing that seems borked is the checksum itself, and only three
> consecutive bits are flipped in the checksum. So this doesn't seem like
> a "stale checksum" - hardware issue is a possibility (the machine has
> ECC RAM though), but it might just as easily be a bug in PostgreSQL,
> when something scribbles over the checksum due to a buffer overflow,
> just before we write the buffer to the OS. So 'false failures' are not
> entirely impossible thing.
>

Not to leave this without any resolution, it seems the issue has been
caused by a SAN. Some configuration changes or something was being done
at the time of the issue, and the SAN somehow ended up writing a page
into a different relfilenode, into a different block. The page was from
a btree index and got written into a heap relfilenode, but otherwise it
was correct - the only thing that changed seems to be the block number,
which explains the minimal difference in the checksum.

I don't think we'll learn much more, but it seems the checksums did
their work in detecting the issue.

>
> So I think we're not ready to enable checksums by default for everyone,
> not until we can provide tools to deal with failures like this (I don't
> think users will be amused if we tell them to use 'dd' and inspect the
> pages in a hex editor).
>
> ISTM the way forward is to keep the current default (disabled), but to
> allow enabling checksums on the fly. That will mostly fix the issue for
> people who actually want checksums but don't realize they need to enable
> them at initdb time (and starting from scratch is not an option for
> them), are running on good hardware and are capable of dealing with
> checksum errors if needed, even without more built-in tooling.
>
> Being able to disable checksums on the fly is nice, but it only really
> solves the issue of extra overhead - it does really help with the
> failures (particularly when you can't even start the database, because
> of a checksum failure in the startup phase).
>
> So, shall we discuss what tooling would be useful / desirable?
>

Although the checksums did detect the issue (we might never notice
without them, or maybe the instance would mysteriously crash), I still
think better tooling is neeed.

I've posted some minor pageinspect improvements I hacked together while
investigating this, but I don't think pageinspect is a very good tool
for investigating checksum / data corruption issues, for a number of
reasons:

1) It does not work at all when the instance does not even start - you
have to manually dump the pages and try inspecting them from another
instance.

2) Even then it assumes the pages are not corrupted, and may easily
cause segfaults or other issues if that's not the case.

3) Works on a manual page-by-page basis.

4) It does not even try to resolve the issue somehow.

For example I think it'd be great to have a tool that work even on
instances that are not running. For example, something that recursively
walks through all files in a data directory, verifies checksums on
everything, lists/dumps pages with broken checksums for further
inspection. I have an alpha-alpha versions of something along those
lines, written before the root cause was identified.

It'd be nice to have something that could help with fixing the issues
(e.g. by fetching the last FPI from the backup, or so). But that's
clearly way more difficult.

There are probably some other tools that might be useful when dealing
with data corruption (e.g. scrubbing to detect it).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-02-20 16:53:23 Re: Should we cacheline align PGXACT?
Previous Message Peter Moser 2017-02-20 16:42:49 Re: [PROPOSAL] Temporal query processing with range types