Re: Enabling Checksums

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enabling Checksums
Date: 2012-11-15 02:22:41
Message-ID: CA+TgmoZWLYnGxDqJ1t5KZpOeO4yDO=99osTxGU0iaA9QP5mu=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 14, 2012 at 6:24 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> Hmm... what if we took this a step further and actually stored the
>> checksums in a separate relation fork? That would make it pretty
>> simple to support enabling/disabling checksums for particular
>> relations. It would also allow us to have a wider checksum, like 32
>> or 64 bits rather than 16. I'm not scoffing at a 16-bit checksum,
>> because even that's enough to catch a very high percentage of errors,
>> but it wouldn't be terrible to be able to support a wider one, either.
>
> I don't remember exactly why this idea was sidelined before, but I don't
> think there were any showstoppers. It does have some desirable
> properties; most notably the ability to add checksums without a huge
> effort, so perhaps the idea can be revived.
>
> But there are some practical issues, as Tom points out. Another one is
> that it's harder for external utilities (like pg_basebackup) to verify
> checksums.
>
> And I just had another thought: these pages of checksums would be data
> pages, with an LSN. But as you clean ordinary data pages, you need to
> constantly bump the LSN of the very same checksum page (because it
> represents 1000 ordinary data pages); making it harder to actually clean
> the checksum page and finish a checkpoint. Is this a practical concern
> or am I borrowing trouble?

Well, I think the invariant we'd need to maintain is as follows: every
page for which the checksum fork might be wrong must have an FPI
following the redo pointer. So, at the time we advance the redo
pointer, we need the checksum fork to be up-to-date for all pages for
which a WAL record was written after the old redo pointer except for
those for which a WAL record has again been written after the new redo
pointer. In other words, the checksum pages we write out don't need
to be completely accurate; the checksums for any blocks we know will
get clobbered anyway during replay don't really matter.

However, reading your comments, I do see one sticking point. If we
don't update the checksum page until a buffer is written out, which of
course makes a lot of sense, then during a checkpoint, we'd have to
flush all of the regular pages first and then all the checksum pages
afterward. Otherwise, the checksum pages wouldn't be sufficiently
up-to-date at the time we write them. There's no way to make that
happen just by fiddling with the LSN; rather, we'd need some kind of
two-pass algorithm over the buffer pool. That doesn't seem
unmanageable, but it's more complicated than what we do now.

I'm not sure we'd actually bother setting the LSN on the checksum
pages, because the action that prompts an update of a checksum page is
the decision to write out a non-checksum page, and that's not a
WAL-loggable action, so there's no obvious LSN to apply, and no
obvious need to apply one at all.

I'm also not quite sure what happens with full_page_writes=off. I
don't really see how to make this scheme work at all in that
environment. Keeping the checksum in the page seems to dodge quite a
few problems in that case ... as long as you assume that 8kB writes
really are atomic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2012-11-15 02:24:03 Re: Doc patch making firm recommendation for setting the value of commit_delay
Previous Message Tom Lane 2012-11-15 02:19:26 Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY