Re: 16-bit page checksums for 9.2

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <simon(at)2ndQuadrant(dot)com>,<heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: <aidan(at)highrise(dot)ca>,<stark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-29 16:44:47
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Heikki Linnakangas wrote:
> On 28.12.2011 01:39, Simon Riggs wrote:
>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
>> wrote:
>>> On 25.12.2011 15:01, Kevin Grittner wrote:
>>>> I don't believe that. Double-writing is a technique to avoid
>>>> torn pages, but it requires a checksum to work. This chicken-
>>>> and-egg problem requires the checksum to be implemented first.
>>> I don't think double-writes require checksums on the data pages
>>> themselves, just on the copies in the double-write buffers. In
>>> the double-write buffer, you'll need some extra information per-
>>> page anyway, like a relfilenode and block number that indicates
>>> which page it is in the buffer.

You are clearly right -- if there is no checksum in the page itself,
you can put one in the double-write metadata. I've never seen that
discussed before, but I'm embarrassed that it never occurred to me.

>> How would you know when to look in the double write buffer?
> You scan the double-write buffer, and every page in the double
> write buffer that has a valid checksum, you copy to the main
> storage. There's no need to check validity of pages in the main
> storage.

Right. I'll recap my understanding of double-write (from memory --
if there's a material error or omission, I hope someone will correct

The write-ups I've seen on double-write techniques have all the
writes to the double-write buffer (a single, sequential file that
stays around). This is done as sequential writing to a file which is
overwritten pretty frequently, making the writes to a controller very
fast, and a BBU write-back cache unlikely to actually write to disk
very often. On good server-quality hardware, it should be blasting
RAM-to_RAM very efficiently. The file is fsync'd (like I said,
hopefully to BBU cache), then each page in the double-write buffer is
written to the normal page location, and that is fsync'd. Once that
is done, the database writes have no risk of being torn, and the
double-write buffer is marked as empty. This all happens at the
point when you would be writing the page to the database, after the

On crash recovery you read through the double-write buffer from the
start and write the pages which look good (including a good checksum)
to the database before replaying WAL. If you find a checksum error
in processing the double-write buffer, you assume that you never got
as far as the fsync of the double-write buffer, which means you never
started writing the buffer contents to the database, which means
there can't be any torn pages there. If you get to the end and
fsync, you can be sure any torn pages from a previous attempt to
write to the database itself have been overwritten with the good copy
in the double-write buffer. Either way, you move on to WAL

You wind up with a database free of torn pages before you apply WAL.
full_page_writes to the WAL are not needed as long as double-write is
used for any pages which would have been written to the WAL. If
checksums were written to the double-buffer metadata instead of
adding them to the page itself, this could be implemented alone. It
would probably allow a modest speed improvement over using
full_page_writes and would eliminate those full-page images from the
WAL files, making them smaller.

If we do add a checksum to the page header, that could be used for
testing for torn pages in the double-write buffer without needing a
redundant calculation for double-write. With no torn pages in the
actual database, checksum failures there would never be false
positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes). There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.



Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-12-29 17:08:43 Re: 16-bit page checksums for 9.2
Previous Message Noah Misch 2011-12-29 16:35:00 Re: Collect frequency statistics for arrays