Re: 16-bit page checksums for 9.2

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-16 11:16:31
Message-ID: CA+U5nMJ-Sc5rrSQqu5ed1sbUbM8Zr-BcZYwhqazSB3q99aFOmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 9, 2012 at 3:16 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
>>> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>>> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>> >> > This patch uses FPIs to guard against torn hint writes, even when the
>>> >> > checksums are disabled. ?One could not simply condition them on the
>>> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>>> >> > a page previously written with page_checksums=on. ?If that write tears,
>>> >> > leaving the checksum intact, that block will now fail verification. ?A couple
>>> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>>> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>>> >> > Whenever the cluster starts with checksums disabled, set the field to
>>> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>>> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>>> >> > failure occurs in a page with LSN older than the stored one, emit either a
>>> >> > softer warning or no message at all.
>>> >>
>>> >> We can only change page_checksums at restart (now) so the above seems moot.
>>> >>
>>> >> If we crash with FPWs enabled we repair any torn pages.
>>> >
>>> > There's no live bug, but that comes at a high cost: the patch has us emit
>>> > full-page images for hint bit writes regardless of the page_checksums setting.
>>>
>>> Sorry, I don't understand what you mean. I don't see any failure cases
>>> that require that.
>>>
>>> page_checksums can only change at a shutdown checkpoint,
>>>
>>> The statement "If that write tears,
>>> >> > leaving the checksum intact, that block will now fail verification."
>>> cannot happen, ISTM.
>>>
>>> If we write out a block we update the checksum if page_checksums is
>>> set, or we clear it. If we experience a torn page at crash, the FPI
>>> corrects that, so the checksum never does fail verification. We only
>>> need to write a FPI when we write with checksums.
>>>
>>> If that's wrong, please explain a failure case in detail.
>>
>> In the following description, I will treat a heap page as having two facts.
>> The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
>> A torn page write lacking the protection of a full-page image can update one
>> or both facts despite the transaction having logically updated both.  (This is
>> just the normal torn-page scenario.)  Given that, consider the following
>> sequence of events:
>>
>> 1. startup with page_checksums = on
>> 2. update page P with facts CHKSUM, !HINT
>> 3. clean write of P
>> 4. clean shutdown
>>
>> 5. startup with page_checksums = off
>> 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
>> 7. begin write of P
>> 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT
>>
>> 9. startup with page_checksums = off
>> 10. crash recovery.  does not touch P, because step 6 generated no WAL
>> 11. clean shutdown
>>
>> 12. startup with page_checksums = on
>> 13. read P.  checksum failure
>>
>> Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
>> because step 6 _does_ write WAL.  That ought to change for the sake of
>> page_checksums=off performance, at which point we must protect the above
>> scenario by other means.
>
> Thanks for explaining.
>
> Logic fixed.
>
>>> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>>> >> > is insufficient.
>>> >>
>>> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
>>> >
>>> > That means also taking the buffer header spinlock in every PageGetLSN() caller
>>> > holding only a shared buffer content lock. ?Do you think that will pay off,
>>> > versus the settled pattern of trading here your shared buffer content lock for
>>> > an exclusive one?
>>>
>>> Yes, I think it will pay off. This is the only code location where we
>>> do that, and we are already taking the buffer header lock, so there is
>>> effectively no overhead.
>>
>> The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
>> (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
>> spinlock at either of those locations.  If they remain safe under the new
>> rules, we'll need comments explaining why.  I think they may indeed be safe,
>> but it's far from obvious.
>
> OK, some slight rework required to do that, no problems.
>
> I checked all other call sites and have updated README to explain.

v8 attached

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v8c.patch text/x-diff 69.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-02-16 11:31:04 Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Previous Message Fujii Masao 2012-02-16 09:15:42 Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)