Re: How to best use 32 15k.7 300GB drives?

From: Robert Schnabel <schnabelr(at)missouri(dot)edu>
To: "david(at)lang(dot)hm" <david(at)lang(dot)hm>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: How to best use 32 15k.7 300GB drives?
Date: 2011-01-28 00:27:13
Message-ID: 4D420D61.3000109@missouri.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 1/27/2011 5:19 PM, david(at)lang(dot)hm wrote:
> On Thu, 27 Jan 2011, Robert Schnabel wrote:
>
>> HI,
>>
>> I use PostgreSQL basically as a data warehouse to store all the genetic data
>> that our lab generates. The only person that accesses the database is myself
>> and therefore I've had it housed on my workstation in my office up till now.
>> However, it's getting time to move it to bigger hardware. I currently have a
>> server that is basically only storing backup images of all our other
>> workstations so I'm going to move my database onto it. The server looks like
>> this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB
>> RAM. For my purposes the CPUs and RAM are fine. I currently have an Adaptec
>> 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and
>> two backup arrays (8 drive each RAID0). The backup arrays are in a 16 drive
>> external enclosure through an expander so I actually have 16 ports free on
>> the 52445 card. I plan to remove 3 of the drives from my backup arrays to
>> make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation). Two
>> 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate
>> 15k.7 300GB drives (ST3300657SS). I also intend on getting an Adaptec 6445
>> controller with the flash module when it becomes available in about a month
>> or two. I already have several Adaptec cards so I'd prefer to stick with
>> them.
>>
>> Here's the way I was planning using the new hardware:
>> xlog& wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller
>> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
>> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
>> 2 - 300G 15k.7 as hot spares enclosure C
>> 4 spare 15k.7 for on the shelf
>>
>> With this configuration I figure I'll have ~3TB for my main data tables and
>> 1TB for indexes. Right now my database is 500GB total. The 3:1 split
>> reflects my current table structure and what I foresee coming down the road
>> in terms of new data.
>>
>> So my questions are 1) am I'm crazy for doing this, 2) would you change
>> anything and 3) is it acceptable to put the xlog& wal (and perhaps tmp
>> filespace) on a different controller than everything else? Please keep in
>> mind I'm a geneticist who happens to know a little bit about bioinformatics
>> and not the reverse. :-)
> a number of questions spring to mind
>
> how much of the time are you expecting to spend inserting data into this
> system vs querying data from the system?
>
> is data arriving continuously, or is it a matter of receiving a bunch of
> data, inserting it, then querying it?
>
> which do you need to optimize for, insert speed or query speed?
>
Bulk loads of GB of data via COPY from csv files once every couple
weeks. I basically only have a couple different table "types" based on
the data going into them. Each type is set up as inherited tables so
there is a new child table for each "sample" that is added. Once the
bulk data is inserted into the tables I generally do some updates on
columns to set values which characterize the data. These columns then
get indexed. Basically once the initial manipulation is done the table
is then static and what I'm looking for is query speed.

> do you expect your queries to be searching for a subset of the data
> scattered randomly throughlut the input data, or do you expect it to be
> 'grab this (relativly) sequential chunk of input data and manipulate it to
> generate a report' type of thing
Generally it is grab a big sequential chunk of data and either dump it
to a csv or insert into another table. I use external scripts to format
data. My two big table structures look like this:

CREATE TABLE genotypes
(
snp_number integer NOT NULL,
sample_id integer NOT NULL,
genotype smallint NOT NULL
)

There are ~58k unique snp_number. Other tables will have upwards of
600-700k snp_number. The child tables have a constraint based on
sample_id such as:
CONSTRAINT check100 CHECK (sample_id > 100000000 AND sample_id < 101000000)

The data is sorted by snp_number, sample_id. So if I want the data for
a given sample_id it would be a block of ~58k rows. The size of the
table depends on how many sample_id's there are. My largest has ~30k
sample_id by 58k snp_number per sample. The other big table (with
children) is "mutations" and is set up similarly so that I can access
individual tables (samples) based on constraints. Each of these
children have between 5-60M records.

> what is your connectvity to the raid enclosures? (does
> putting 22 drives on one cable mean that you will be limited due to the
> bandwidth of this cable rather than the performance of the drives)
>
> can you do other forms of raid on these drives or only raid 10?
This is all direct attach storage via SAS2 so I'm guessing it's probably
limited to the single port link between the controller and the
expander. Again, geneticist here not computer scientist. ;-) The
enclosures have Areca ARC-8026-16 expanders. I can basically do
whatever RAID level I want.

> how critical is the data in this database? if it were to die would it just
> be a matter of recreating it and reloading the data? or would you loose
> irreplaceable data?
>
> David Lang
All of the data could be reloaded. Basically, once I get the data into
the database and I'm done manipulating it I create a backup copy/dump
which then gets stored at a couple different locations. Like I said, I
really only do big loads/updates periodically so if it tanked all I'd be
out is whatever I did since the last backup/dump and some time.

My goal is to 1) have a fairly robust system so that I don't have to
spend my time rebuilding things and 2) be able to query the data
quickly. Most of what I do are ad hoc queries. I have an idea... "how
many X have Y in this set of Z samples" and write the query to get the
answer. I can wait a couple minutes to get an answer but waiting an
hour is becoming tiresome.

Bob

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message david 2011-01-28 00:53:00 Re: How to best use 32 15k.7 300GB drives?
Previous Message Alan Hodgson 2011-01-28 00:11:18 Re: How to best use 32 15k.7 300GB drives?