9.3: summary of corruption detection / checksums / CRCs discussion

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: 9.3: summary of corruption detection / checksums / CRCs discussion
Date: 2012-04-21 21:40:57
Message-ID: 1335044457.25680.99.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A lot of discussion took place regarding corruption detection, and I am
attempting to summarize it in a useful way. Please excuse the lack of
references; I'm hoping to agree on the basic problem space and the
nature of the solutions offered, and then turn it into a wiki where we
can get into the details. Also, almost every discussion touched on
several of these issues.

Please help me fill in the parts of the problem and directions in the
high-level solution that are missing. This thread is not intended to get
into low-level design or decision making.

First, I'll get a few of the smaller issues out of the way:

* In addition to data pages and slru/clog, it also may be useful to
detect problems with temp files. It's also much easier, because we don't
need to worry about upgrade or crash scenarios. Performance impact
unknown, but could easily be turned on/off at runtime.

* In addition to detecting random garbage, we also need to be able to
detect zeroing of pages. Right now, a zero page is not considered
corrupt, so that's a problem. We'll need to WAL table extension
operations, and we'll need to mitigate the performance impact of doing
so. I think we can do that by extending larger tables by many pages
(say, 16 at a time) so we can amortize the cost of WAL and avoid
contention.

* Utilities, like those for taking a base backup, should also verify the
checksum.

* In addition to detecting random garbage and zeros, we need to detect
entire pages being transposed into different parts of the same file or
different files. To do this we can include the database ID, tablespace,
relfilenode, and page number in the CRC calculation. Perhaps only
include relfilenode and page# to make it easier for utilities to check.
None of this information needs to actually be stored on the page, so it
doesn't affect the header. However, if we are going to expand the page
header anyway, it would be useful to include this information so that
the CRC can be calculated without any external context.

Now, onto the two big problems, upgrade and torn pages:

-----------------------------------------------
UPGRADE (and on/off)
-----------------------------------------------

* Should we try to use existing space in header? It's very appealing to
be able to avoid the upgrade work by using existing space in the header.
There was a surprising amount of discussion about which space to use:
pd_pagesize_version or pd_tli. There was also some discussion of using a
few bits from various places.

* Table-level, or system level? Table-level would be appealing if there
turns out to be a significant performance impact. But there are
challenges during recovery, because no relcache is available. It seems
like a relatively minor problem, because pages would indicate whether
they have a checksum or not, but there are some details to be worked
out.

* If we do expand the header, we need an upgrade path. One proposed
approach is to start reserving the necessary space in the previous
version (with a simple point release), and have some way to verify that
all of the pages have the required free space to upgrade. Then, the new
version can update pages gradually, with some final VACUUM to ensure
that all pages are the new version. That sounds easy, except that we
need some way to free up space on the old pages in the old version,
which is non-trivial. For heap pages, that could be like an update; but
for index pages, it would need to be something like a page split, which
is specific to the index type.

* Also, if we expand the page header, we need to figure out something
for the SLRU/CLOG as well.

* We'll need some variant of VACUUM to turn checksums on/off (either
per-table or system wide).

-----------------------------------------------
TORN PAGES
-----------------------------------------------

We don't want torn pages to falsely indicate a checksum failure. Many
page writes are already protected from this with full-page images in the
WAL; but hint bit updates (including the index dead tuple markers) are
not.

* Just pay the price -- WAL all hint bit updates, including FPIs.

* Double-Write buffer -- this attacks the problem most directly. Don't
make any changes to the way hint bits are done; instead, push all page
writes through a double-write buffer. There are numerous performance
implications here, some of which may improve performance and some which
may hurt performance. It's hard to say, at the end, whether this will be
a good solution for everyone (particularly those without battery-backed
caches), but it seems like an accepted approach that can be very good
for the people who need performance the most.

* Bulk Load -- this is more indirect. The idea is that, during normal
OLTP operation, using the WAL for hints might not be so bad, because the
page is likely to need a FPI for some other reason. The worst case is
when bulk loading, so see if we can set hint bits during the bulk load
in an MVCC-safe way.
http://archives.postgresql.org/message-id/CABRT9RBRMdsoz8KxgeHfb4LG-ev9u67-6DLqvoiibpkKhTLQfw@mail.gmail.com

* Some way of caching CLOG information or making the access faster.
IIRC, there were some vague ideas about mmapping() the CLOG, or caching
a very small representation of the CLOG.

* Something else -- There are a few other lines of thought here. For
instance, can we use WAL for hint bits without a FPI, and still protect
against torn pages causing CRC failures? This is related to a comment
during the 2011 developer meeting, where someone brought up the idea of
idempotent WAL changes, and how that could help us avoid FPIs. It seems
possible after reading the discussions, but not clear enough on the
challenges to summarize here.

If we do use WAL for hint bit updates, that has an impact on Hot
Standby, because HS can't write WAL. So, it would seem that HS could not
set hint bits.

Comments?

Regards,
Jeff Davis

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2012-04-21 23:08:42 Re: 9.3: summary of corruption detection / checksums / CRCs discussion
Previous Message Jeff Davis 2012-04-21 20:08:31 Re: RFC: Making TRUNCATE more "MVCC-safe"