Re: Big wide datasets

From: "Robert D(dot) Schnabel" <schnabelr(at)missouri(dot)edu>
To: Michael Lush <mjlush(at)gmail(dot)com>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Big wide datasets
Date: 2011-12-08 13:39:16
Message-ID: 4EE0BE04.5020707@missouri.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<br>
On 12/8/2011 7:05 AM, Michael Lush wrote:
<blockquote
cite="mid:CACXX7MdoDdACfJMfhnugNoGxAhe-n5kxr716tGt6iUZ1n4ZKyQ(at)mail(dot)gmail(dot)com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
I have dataset with ~10000 columns and ~200000 rows (GWAS data
(1)) in the form<br>
<br>
sample1, A T, A A, G C, ....<br>
sampel2, A C, C T, A A, ....<br>
<br>
I'd like to take subsets of both columns and rows for analysis<br>
<br>
Two approaches spring to mind either unpack it into something like
an RDF triple<br>
<br>
ie <br>
CREATE TABLE long_table (<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;&nbsp; sample_id&nbsp; varchar(20),<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; column_number int,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; snp_data&nbsp; varchar(3));<br>
<br>
for a table with 20 billion rows<br>
<br>
or use the array datatype<br>
<br>
CREATE TABLE wide_table (<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sample_id,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; snp_data[]);<br>
<br>
Does anyone have any experience of this sort of thing?<br>
<br>
(1) <a moz-do-not-send="true"
href="http://en.wikipedia.org/wiki/Genome-wide_association_study"
target="_blank">http://en.wikipedia.org/wiki/Genome-wide_association_study</a><br>
<br>
--<br>
Michael Lush<br>
</blockquote>
<br>
I store all my genotype data similar to this.&nbsp; <br>
<br>
<tt>CREATE TABLE gen1000<br>
(<br>
-- Inherited from table genotypes:&nbsp; snp_number integer NOT NULL,<br>
-- Inherited from table genotypes:&nbsp; sample_id integer NOT NULL,<br>
-- Inherited from table genotypes:&nbsp; genotype smallint NOT NULL,<br>
&nbsp; CONSTRAINT check1000 CHECK (snp_number &lt; 58337 AND sample_id
&gt; 100000000 AND sample_id &lt; 101000000)<br>
)<br>
INHERITS (genotypes)<br>
WITH (<br>
&nbsp; OIDS=FALSE<br>
);<br>
ALTER TABLE gen1000<br>
&nbsp; OWNER TO postgres;<br>
COMMENT ON TABLE gen1000<br>
&nbsp; IS '100 ANG';<br>
<br>
-- Index: xgen1000_sample_id<br>
CREATE INDEX xgen1000_sample_id<br>
&nbsp; ON gen1000<br>
&nbsp; USING btree<br>
&nbsp; (sample_id )<br>
&nbsp; WITH (FILLFACTOR=100)<br>
TABLESPACE index_tablespace;<br>
ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id;<br>
<br>
-- Index: xgen1000_snp_number<br>
CREATE INDEX xgen1000_snp_number<br>
&nbsp; ON gen1000<br>
&nbsp; USING btree<br>
&nbsp; (snp_number )<br>
&nbsp; WITH (FILLFACTOR=100)<br>
TABLESPACE index_tablespace;</tt><br>
<br>
My implementation is basically a data warehouse where I am the only
person with access to the db.<br>
<br>
There are several benefits to storing your genotypes this way and
there are benefits to storing your genotypes in more of a matrix
format.&nbsp; If you store them as your "long" table you really need to
think about whether or not you can partition your data.&nbsp; In my case
I'm able to partition by sample_id because all of my animals from a
given breed (population, race, location etc) have IDs in a certain
range.&nbsp; Furthermore, I'm able to create partitions based on which
assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came
from because my snp_numbers (rather than marker names) are integer
and specifically structured so ranges of integers correspond to an
assay.&nbsp; Therefore, a given genotype table is only as long as the
number of animals X number of markers on the assay.&nbsp; I have one
partition that would be very large (20000 animals X 60000 markers =
1.2B) but what I did was further split it up into sub-partitions of
~220M.<br>
<br>
I have separate tables for marker information and sample information
which are keyed on snp_number and sample_id.&nbsp; Given the structure
and appropriate indexes, and check constraints I'm able to grab
specific "chunks" of data fairly easily.<br>
<br>
As a side note, I take every opportunity I get to advise people not
to store GWAS data in base format (AGCT).&nbsp; If you are using Illumina
data at the very least store data and work with it in A/B format.&nbsp;
Ditto for Affymetrix.&nbsp; I actually convert the A/B format to my own
coding and store genotypes as a single small integer (genotype
above) where AA=1, BB=2, AB=3, missing=10.&nbsp; At some point I'm going
to change this to missing =0.&nbsp; Here's why this is important.&nbsp; With
this coding you are able to store the genotype of an individual
using 2 bytes.&nbsp; With this coding you can also store phase
information if you have it... B/A=4, A/null=5, B/null =6, null/A=7,
null/B=8, null/null=9.&nbsp; One other side benefit, if you are wanting
to check parent/child inheritance all you need to do is find all the
loci where parent_genotype + child_genotype = 3 and count the number
of loci.&nbsp; Those are the only informative genotypes.&nbsp; If you're
working in base space it's more complicated. &nbsp; Furthermore, you
could take this data in columnar form and dump it and rewrite it as
a string (wide) where each value is an individual genotype, which is
about as compact as you can get it.&nbsp; You just need to write your
analysis software to decode it.<br>
<br>
Bob<br>
<br>
<br>
</body>
</html>

Attachment Content-Type Size
unknown_filename text/html 6.4 KB

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Gene Poole 2011-12-08 15:04:37 Re: [GENERAL] Convert / Migrate From Oracle 11gR2 To PostgreSQL ? On CentOS 5.7 x86_64
Previous Message Jean-Yves F. Barbier 2011-12-08 13:24:44 Re: Big wide datasets