On 12/8/2011 7:05 AM, Michael Lush wrote:
I have dataset with ~10000 columns and ~200000 rows (GWAS data (1)) in the form

sample1, A T, A A, G C, ....
sampel2, A C, C T, A A, ....

I'd like to take subsets of both columns and rows for analysis

Two approaches spring to mind either unpack it into something like an RDF triple

ie
CREATE TABLE long_table (
                               sample_id  varchar(20),
                               column_number int,
                               snp_data  varchar(3));

for a table with 20 billion rows

or use the array datatype

CREATE TABLE wide_table (
                                sample_id,
                                snp_data[]);

Does anyone have any experience of this sort of thing?

(1) http://en.wikipedia.org/wiki/Genome-wide_association_study

--
Michael Lush

I store all my genotype data similar to this. 

CREATE TABLE gen1000
(
-- Inherited from table genotypes:  snp_number integer NOT NULL,
-- Inherited from table genotypes:  sample_id integer NOT NULL,
-- Inherited from table genotypes:  genotype smallint NOT NULL,
  CONSTRAINT check1000 CHECK (snp_number < 58337 AND sample_id > 100000000 AND sample_id < 101000000)
)
INHERITS (genotypes)
WITH (
  OIDS=FALSE
);
ALTER TABLE gen1000
  OWNER TO postgres;
COMMENT ON TABLE gen1000
  IS '100 ANG';

-- Index: xgen1000_sample_id
CREATE INDEX xgen1000_sample_id
  ON gen1000
  USING btree
  (sample_id )
  WITH (FILLFACTOR=100)
TABLESPACE index_tablespace;
ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id;

-- Index: xgen1000_snp_number
CREATE INDEX xgen1000_snp_number
  ON gen1000
  USING btree
  (snp_number )
  WITH (FILLFACTOR=100)
TABLESPACE index_tablespace;


My implementation is basically a data warehouse where I am the only person with access to the db.

There are several benefits to storing your genotypes this way and there are benefits to storing your genotypes in more of a matrix format.  If you store them as your "long" table you really need to think about whether or not you can partition your data.  In my case I'm able to partition by sample_id because all of my animals from a given breed (population, race, location etc) have IDs in a certain range.  Furthermore, I'm able to create partitions based on which assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came from because my snp_numbers (rather than marker names) are integer and specifically structured so ranges of integers correspond to an assay.  Therefore, a given genotype table is only as long as the number of animals X number of markers on the assay.  I have one partition that would be very large (20000 animals X 60000 markers = 1.2B) but what I did was further split it up into sub-partitions of ~220M.

I have separate tables for marker information and sample information which are keyed on snp_number and sample_id.  Given the structure and appropriate indexes, and check constraints I'm able to grab specific "chunks" of data fairly easily.

As a side note, I take every opportunity I get to advise people not to store GWAS data in base format (AGCT).  If you are using Illumina data at the very least store data and work with it in A/B format.  Ditto for Affymetrix.  I actually convert the A/B format to my own coding and store genotypes as a single small integer (genotype above) where AA=1, BB=2, AB=3, missing=10.  At some point I'm going to change this to missing =0.  Here's why this is important.  With this coding you are able to store the genotype of an individual using 2 bytes.  With this coding you can also store phase information if you have it... B/A=4, A/null=5, B/null =6, null/A=7, null/B=8, null/null=9.  One other side benefit, if you are wanting to check parent/child inheritance all you need to do is find all the loci where parent_genotype + child_genotype = 3 and count the number of loci.  Those are the only informative genotypes.  If you're working in base space it's more complicated.   Furthermore, you could take this data in columnar form and dump it and rewrite it as a string (wide) where each value is an individual genotype, which is about as compact as you can get it.  You just need to write your analysis software to decode it.

Bob