Re: BBU Cache vs. spindles

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, jd(at)commandprompt(dot)com, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-performance(at)postgresql(dot)org, Ben Chobot <bench(at)silentmedia(dot)com>
Subject: Re: BBU Cache vs. spindles
Date: 2010-12-01 03:07:18
Message-ID: 201012010307.oB137IA19179@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-www

Greg Smith wrote:
> Kevin Grittner wrote:
> > I assume that we send a full
> > 8K to the OS cache, and the file system writes disk sectors
> > according to its own algorithm. With either platters or BBU cache,
> > the data is persisted on fsync; why do you see a risk with one but
> > not the other
>
> I'd like a 10 minute argument please. I started to write something to
> refute this, only to clarify in my head the sequence of events that
> leads to the most questionable result, where I feel a bit less certain
> than I did before of the safety here. Here is the worst case I believe
> you're describing:
>
> 1) Transaction is written to the WAL and sync'd; client receives
> COMMIT. Since full_page_writes is off, the data in the WAL consists
> only of the delta of what changed on the page.
> 2) 8K database page is written to OS cache
> 3) PG calls fsync to force the database block out
> 4) OS writes first 4K block of the change to the BBU write cache. Worst
> case, this fills the cache, and it takes a moment for some random writes
> to process before it has space to buffer again (makes this more likely
> to happen, but it's not required to see the failure case here)
> 5) Sudden power interruption, second half of the page write is lost
> 6) Server restarts
> 7) That 4K write is now replayed from the battery's cache
>
> At this point, you now have a torn 8K page, with 1/2 old and 1/2 new

Based on this report, I think we need to update our documentation and
backpatch removal of text that says that BBU users can safely turn off
full-page writes. Patch attached.

I think we have fallen into a trap I remember from the late 1990's where
I was assuming that an 8k-block based file system would write to the
disk atomically in 8k segments, which of course it cannot. My bet is
that even if you write to the kernel in 8k pages, and have an 8k file
system, the disk is still accessed via 512-byte blocks, even with a BBU.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachment Content-Type Size
/pgpatches/bbu text/x-diff 1.4 KB

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bruce Momjian 2010-12-01 03:13:12 Re: BBU Cache vs. spindles
Previous Message Joshua D. Drake 2010-12-01 01:47:32 Re: SELECT INTO large FKyed table is slow

Browse pgsql-www by date

  From Date Subject
Next Message Bruce Momjian 2010-12-01 03:13:12 Re: BBU Cache vs. spindles
Previous Message Manuel Sugawara 2010-11-27 15:33:53 Re: [HACKERS] Favorable i--)