Re: BBU Cache vs. spindles

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, jd(at)commandprompt(dot)com, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-performance(at)postgresql(dot)org, Ben Chobot <bench(at)silentmedia(dot)com>
Subject: Re: BBU Cache vs. spindles
Date: 2010-10-22 03:47:24
Message-ID: 4CC1094C.1090306@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-www

Kevin Grittner wrote:
> I assume that we send a full
> 8K to the OS cache, and the file system writes disk sectors
> according to its own algorithm. With either platters or BBU cache,
> the data is persisted on fsync; why do you see a risk with one but
> not the other

I'd like a 10 minute argument please. I started to write something to
refute this, only to clarify in my head the sequence of events that
leads to the most questionable result, where I feel a bit less certain
than I did before of the safety here. Here is the worst case I believe
you're describing:

1) Transaction is written to the WAL and sync'd; client receives
COMMIT. Since full_page_writes is off, the data in the WAL consists
only of the delta of what changed on the page.
2) 8K database page is written to OS cache
3) PG calls fsync to force the database block out
4) OS writes first 4K block of the change to the BBU write cache. Worst
case, this fills the cache, and it takes a moment for some random writes
to process before it has space to buffer again (makes this more likely
to happen, but it's not required to see the failure case here)
5) Sudden power interruption, second half of the page write is lost
6) Server restarts
7) That 4K write is now replayed from the battery's cache

At this point, you now have a torn 8K page, with 1/2 old and 1/2 new
data. Without a full page write in the WAL, is it always possible to
restore its original state now? In theory, I think you do. Since the
delta in the WAL should be overwriting all of the bytes that changed
between the old and new version of the page, applying it on top of any
four possible states here:

1) None of the data was written to the database page yet
2) The first 4K of data was written out
3) The second 4K of data was written out
4) All 8K was actually written out

Should lead to the same result: an 8K page that includes the change that
was in the WAL but not onto disk at the point when the crash happened.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2010-10-22 04:05:27 Re: BBU Cache vs. spindles
Previous Message Scott Carey 2010-10-21 23:11:22 Re: Slow count(*) again...

Browse pgsql-www by date

  From Date Subject
Next Message Tom Lane 2010-10-22 04:05:27 Re: BBU Cache vs. spindles
Previous Message Joshua Tolley 2010-10-22 02:10:19 Re: Doc search fail