Re: Page Checksums + Double Writes

From: Jignesh Shah <jkshah(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, David Fetter <david(at)fetter(dot)org>
Subject: Re: Page Checksums + Double Writes
Date: 2011-12-22 18:50:23
Message-ID: CAGvK12VvJ95WnMQEOZHC4MhZv7kCn6OtcmOwCzVV069J2wx6dg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Jignesh Shah <jkshah(at)gmail(dot)com> wrote:
>
>> When we use Doublewrite with checksums, we can safely disable
>> full_page_write causing a HUGE reduction to the WAL traffic
>> without loss of reliatbility due to a write fault since there are
>> two writes always. (Implementation detail discussable).
>
> The "always" there surprised me.  It seemed to me that we only need
> to do the double-write where we currently do full page writes or
> unlogged writes.  In thinking about your message, it finally struck

Currently PG only does full page write for the first change that makes
the dirty after a checkpoint. This scheme works when all changes are
relative to that first page so when checkpoint write fails then it can
recreate the page by using the full page write + all the delta changes
from WAL.

In the double write implementation, every checkpoint write is double
writed, so if the first doublewrite page write fails then then
original page is not corrupted and if the second write to the actual
datapage fails, then one can recover it from the earlier write. Now
while it seems that there are 2X double writes during checkpoint is
true. I can argue that there are the same 2 X writes right now except
1X of the write goes to WAL DURING TRANSACTION COMMIT. Also since
doublewrite is generally written in its own file it is essentially
sequential so it doesnt have the same write latencies as the actual
checkpoint write. So if you look at the net amount of the writes it is
the same. For unlogged tables even if you do doublewrite it is not
much of a penalty while that may not be logging before in the WAL. By
doing the double write for it, it is still safe and gives resilience
for those tables to it eventhough it is not required. The net result
is that the underlying page is never "irrecoverable" due to failed
writes.

> me that this might require a WAL record to be written with the
> checksum (or CRC; whatever we use).  Still, writing a WAL record
> with a CRC prior to the page write would be less data than the full
> page.  Doing double-writes instead for situations without the torn
> page risk seems likely to be a net performance loss, although I have
> no benchmarks to back that up (not having a double-write
> implementation to test).  And if we can get correct behavior without
> doing either (the checksum WAL record or the double-write), that's
> got to be a clear win.

I am not sure why would one want to write the checksum to WAL.
As for the double writes, infact there is not a net loss because
(a) the writes to the doublewrite area is sequential the writes calls
are relatively very fast and infact does not cause any latency
increase to any transactions unlike full_page_write.
(b) It can be moved to a different location to have no stress on the
default tablespace if you are worried about that spindle handling 2X
writes which is mitigated in full_page_writes if you move pg_xlogs to
different spindle

and my own tests supports that the net result is almost as fast as
full_page_write=off but not the same due to the extra write (which
gives you the desired reliability) but way better than
full_page_write=on.

Regards,
Jignesh

> -Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Phil Sorber 2011-12-22 19:02:28 Re: WIP patch: Improve relation size functions such as pg_relation_size() to avoid producing an error when called against a no longer visible relation
Previous Message Tom Lane 2011-12-22 18:33:53 Re: WIP patch: Improve relation size functions such as pg_relation_size() to avoid producing an error when called against a no longer visible relation