Re: Page Checksums + Double Writes

From: Jignesh Shah <jkshah(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, David Fetter <david(at)fetter(dot)org>
Subject: Re: Page Checksums + Double Writes
Date: 2011-12-22 20:23:23
Message-ID: CAGvK12ULTkYVs_6OXMv-5EH3APXxC74R-w-17tmiVu9MyN2j+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 22, 2011 at 3:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah(at)gmail(dot)com> wrote:
>> In the double write implementation, every checkpoint write is double
>> writed,
>
> Unless I'm quite thoroughly confused, which is possible, the double
> write will need to happen the first time a buffer is written following
> each checkpoint.  Which might mean the next checkpoint, but it could
> also be sooner if the background writer kicks in, or in the worst case
> a buffer has to do its own write.
>

Logically the double write happens for every checkpoint write and it
gets fsynced.. Implementation wise you can do a chunk of those pages
like we do in sets of pages and sync them once and yes it still
performs better than full_page_write. As long as you compare with
full_page_write=on, the scheme is always much better. If you compare
it with performance of full_page_write=off it is slightly less but
then you lose the the reliability. So for performance testers like me
who always turn off full_page_write anyway during my benchmark run
will not see any impact. However for folks in production who are
rightly scared to turn off full_page_write will have an ability to
increase performance without being scared on failed writes.

> Furthermore, we can't *actually* write any pages until they are
> written *and fsync'd* to the double-write buffer.  So the penalty for
> the background writer failing to do the right thing is going to go up
> enormously.  Think about VACUUM or COPY IN, using a ring buffer and
> kicking out its own pages.  Every time it evicts a page, it is going
> to have to doublewrite the buffer, fsync it, and then write it for
> real.  That is going to make PostgreSQL 6.5 look like a speed demon.

Like I said implementation detail wise it depends on how many such
pages do you sync simultaneously and the real tests prove that it is
actually much faster than one expects.

> The background writer or checkpointer can conceivably dump a bunch of
> pages into the doublewrite area and then fsync the whole thing in
> bulk, but a backend that needs to evict a page only wants one page, so
> it's pretty much screwed.
>

Generally what point you pay the penalty is a trade off.. I would
argue that you are making me pay for the full page write for my first
transaction commit that changes the page which I can never avoid and
the result is I get a transaction response time that is unacceptable
since the deviation of a similar transaction which modifies the page
already made dirty is lot less. However I can avoid page evictions if
I select a bigger bufferpool (not necessarily that I want to do that
but I have a choice without losing reliability).

Regards,
Jignesh

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2011-12-22 20:23:37 atexit vs. on_exit
Previous Message Robert Haas 2011-12-22 20:19:16 Re: WIP patch: Improve relation size functions such as pg_relation_size() to avoid producing an error when called against a no longer visible relation