Re: 16-bit page checksums for 9.2

From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-30 14:44:09
Message-ID: CAC_2qU-OnB4Zpcs77q7Xo4L+vBOhFc-RKS6WJNWFv+7m8jzoNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> You wind up with a database free of torn pages before you apply WAL.
> full_page_writes to the WAL are not needed as long as double-write is
> used for any pages which would have been written to the WAL.  If
> checksums were written to the double-buffer metadata instead of
> adding them to the page itself, this could be implemented alone.  It
> would probably allow a modest speed improvement over using
> full_page_writes and would eliminate those full-page images from the
> WAL files, making them smaller.

Correct. So now lots of people seem to be jumping on the double-write
bandwagon and looking at some the things it promise: All writes are
durable

This solves 2 big issues:
- Remove torn-page problem
- Remove FPW from WAL

That up front looks pretty attractive. But we need to look at the
tradeoffs, and then decide (benchmark anyone).

Remember, postgresql is a double-write system right now. The 1st,
checkumed write is the FPW in WAL. It's fsynced. And the 2nd synced
write is when the file is synced during checkpoint.

So, postgresql currently has an optimization now that not every write
has *requirements* for atomic, instant durability. And so postgresql
get's to do lots of writes to the OS cache and *not* request them to
be instantly synced. And then at some point, when it's reay to clear
the 1st checksumed write, make sure everywrite is synced. And lots of
work went into PG recently to get even better at the collection of
writes/syncs that happen at checkpoint time to take even biger
advantage of the fact that its' better to write everything in a fil
efirst, then call a single sync.

So moving to this new double-write-area bandwagon, we move from a "WAL
FPW synced at the commit, collect as many other writes, then final
sync" type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time. So we avoid the FPW at
commit (yes, that's nice for latency), and we guarentee every buffer
written is consistent (that fixes our hit-bit-only dirty writes from
being torn). And we do that at a cost of every buffer write requiring
2 fsyncs, in a serial fashion. Come checkpoint, I'm wondering....

Again, all that to avoid a single "optimization" that postgresql currently has:
1) writes for hint-bit only buffers don't need to be durable

And the problem that optimization introduces:
1) Since they aren't guarenteed durable, we can't believe a checksum

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jean-Yves F. Barbier 2011-12-30 16:06:13 Re: index refuses to build [DEFINITELY SOLVED :-]
Previous Message Kevin Grittner 2011-12-30 14:33:15 Re: 16-bit page checksums for 9.2