Re: Hardware vs Software RAID

From: "Peter T(dot) Breuer" <ptb(at)inv(dot)it(dot)uc3m(dot)es>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware vs Software RAID
Date: 2008-06-26 13:49:44
Message-ID: 200806261349.m5QDniN6026724@betty.it.uc3m.es
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

"Also sprach Merlin Moncure:"
> As discussed down thread, software raid still gets benefits of
> write-back caching on the raid controller...but there are a couple of

(I wish I knew what write-back caching was!)

Well, if you mean the Linux software raid driver, no, there's no extra
caching (buffering). Every request arriving at the device is duplicated
(for RAID1), using a local finite cache of buffer head structures and
real extra muffers from the kernel's general resources. Every arriving
request is dispatched two its subtargets as it arrives (as two or more
new requests). On reception of both (or more) acks, the original
request is acked, and not before.

This imposes a considerable extra resource burden. It's a mystery to me
why the driver doesn't deadlock against other resource eaters that it
may depend on. Writing to a device that also needs extra memory per
request in its driver should deadlock it, in theory. Against a network
device as component, it's a problem (tcp needs buffers).

However the lack of extra buffering is really deliberate (double
buffering is a horrible thing in many ways, not least because of the
probable memory deadlock against some component driver's requirement).
The driver goes to the lengths of replacing the kernel's generic
make_request function just for itself in order to make sure full control
resides in the driver. This is required, among other things, to make
sure that request order is preserved, and that requests.

It has the negative that standard kernel contiguous request merging does
not take place. But that's really required for sane coding in the
driver. Getting request pages into general kernel buffers ... may happen.

> things I'd like to add. First, if your sever is extremely busy, the
> write back cache will eventually get overrun and performance will
> eventually degrade to more typical ('write through') performance.

I'd like to know where this 'write back cache' s! (not to mention what
it is :). What on earth does `write back' mean? Peraps you mean the
kernel's general memory system, which has the effect of buffering
and caching requests on the way to drivers like raid. Yes, if you write
to a device, any device, you will only write to the kernel somwhere,
which may or may not decide now or later to send the dirty buffers thus
created on to the driver in question, either one by one or merged. But
as I said, raid replaces most of the kernel's mechanisms in that area
(make_request, plug) to avoid losing ordering. I would be surprised if
the raw device exhibited any buffering at all after getting rid of the
generic kernel mechanisms. Any buffering you see would likely be
happening at file system level (and be a darn nuisance).

Reads from the device are likely to hit the kernel's existing buffers
first, thus making them act as a "cache".

> Secondly, many hardware raid controllers have really nasty behavior in
> this scenario. Linux software raid has decent degradation in overload

I wouldn't have said so! If there is any, it's sort of accidental. On
memory starvation, the driver simply couldn't create and despatch
component requests. Dunno what happens then. It won't run out of buffer
head structs though, since it's pretty well serialised on those, per
device, in order to maintain request order, and it has its own cache.

> conditions but many popular raid controllers (dell perc/lsi logic sas
> for example) become unpredictable and very bursty in sustained high
> load conditions.

Well, that's because they can't tell the linux memory manager to quit
storing data from them in memory and let them have it NOW (a general
problem .. how one gets feedback on the mm state, I don't know). Maybe one
could .. one can control buffer aging pretty much per device nowadays.
Perhaps one can set the limit to zero for buffer age in memory before
being sent to the device. That would help. Also one can lower the
bdflush limit at which the device goes sync. All that would help against
bursty performance, but it would slow ordinary operation towards sync
behaviour.

> As greg mentioned, I trust the linux kernel software raid much more
> than the black box hw controllers. Also, contrary to vast popular

Well, it's readable code. That's the basis for my comments!

> mythology, the 'overhead' of sw raid in most cases is zero except in
> very particular conditions.

It's certainly very small. It would be smaller still if we could avoid
needing new buffers per device. Perhaps the dm multipathing allows that.

Peter

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Sullivan 2008-06-26 13:53:41 Re: ??: Postgresql update op is very very slow
Previous Message Greg Smith 2008-06-26 13:45:21 Re: Hardware vs Software RAID