Re: Scaling with memory & disk planning

From: Curt Sampson <cjs(at)cynic(dot)net>
To: terry(at)greatgulfhomes(dot)com
Cc: 'Jean-Luc Lachance' <jllachan(at)nsd(dot)ca>, <kgunders(at)cbnlottery(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Scaling with memory & disk planning
Date: 2002-05-31 03:24:22
Message-ID: Pine.NEB.4.43.0205311146251.448-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Jean-Luc Lachance writes:

> > I think your undestanding of RAID 5 is wrong also.
> >
> > For a general N disk RAID 5 the process is:
> > 1)Read sector
> > 2)XOR with data to write
> > 3)Read parity sector
> > 4)XOR with result above
> > 5)write data
> > 6)write parity

Yes, generally. There are a couple of tricks you can do to help get
around this, though.

One, which works very nicely when doing sequential writes, is to attempt
to hold off on the write until you collect an entire stripe's worth
of data. Then you can calculate the parity based on what's in memory,
and write the new blocks across all of the disks without worrying
about what was on them before. 3ware's Escalade IDE RAID controllers
(the 3W-7x50 series, anyway) do this. Their explanation is at
http://www.3ware.com/NewFaq/general_operating_and_troubleshooting.htm#R5
_Fusion_Explained .

Another tactic is just to buffer entire stripes. Baydel does this
with their disk arrays, which are actually RAID-3, not RAID-5.
(Since they do only full-stripe reads and writes, it doesn't really
make any difference which they use.) You want a fair amount of RAM
in your controller for buffering in this case, but it keeps the
computers "read, modify, write" cycle on one block from turning
into "read, read, modify, write".

Terry Fielder writes:

> My simplification was intended, anyway it still equates to the same,
> because in a performance machine (lots of memory) reads are (mostly)
> pulled from cache (not disk IO). So the real cost is disk writes, and
> 2 = 2.

Well, it really depends on your workload. If you're selecting stuff
almost completely randomly scattered about a large table (like the
25 GB one I'm dealing with right now), it's going to be a bit pricy
to get hold of a machine with enough memory to cache that effectively.

Kurt Gunderson writes:

] Likewise, when writing to the mirrored pair (and using 'write-through',
] never 'write-back'), the controller will pass along the 'data written'
] flag to the CPU when the first disk of the pair writes the data. The
] second will sync eventually but the controller need not wait for both.

I hope not! I think that controller ought to wait for both to be written,
because otherwise you can have this scenario:

1. Write of block X scheduled for drives A and B.
2. Block written to drive A. Still pending on drive B.
3. Controller returns "block committed to stable storage" to application.
4. Power failure. Pending write to drive B is never written.

Now, how do you know, when the system comes back up, that you have
a good copy of the block on drive A, but not on drive B?

cjs
--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tahira Aslam 2002-05-31 05:06:34 Re: How to pickup null values in SQL Language?
Previous Message Tom Lane 2002-05-31 02:15:59 Re: storage space