Skip site navigation (1) Skip section navigation (2)

Re: Big wide datasets

From: "Robert D(dot) Schnabel" <schnabelr(at)missouri(dot)edu>
To: Michael Lush <mjlush(at)gmail(dot)com>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Big wide datasets
Date: 2011-12-08 13:39:16
Message-ID: 4EE0BE04.5020707@missouri.edu (view raw or flat)
Thread:
Lists: pgsql-novice
<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    <br>
    On 12/8/2011 7:05 AM, Michael Lush wrote:
    <blockquote
cite="mid:CACXX7MdoDdACfJMfhnugNoGxAhe-n5kxr716tGt6iUZ1n4ZKyQ(at)mail(dot)gmail(dot)com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      I have dataset with ~10000 columns and ~200000 rows (GWAS data
      (1)) in the form<br>
      <br>
      sample1, A T, A A, G C, ....<br>
      sampel2, A C, C T, A A, ....<br>
      <br>
      I'd like to take subsets of both columns and rows for analysis<br>
      <br>
      Two approaches spring to mind either unpack it into something like
      an RDF triple<br>
      <br>
      ie <br>
      CREATE TABLE long_table (<br>
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;&nbsp; sample_id&nbsp; varchar(20),<br>
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; column_number int,<br>
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; snp_data&nbsp; varchar(3));<br>
      <br>
      for a table with 20 billion rows<br>
      <br>
      or use the array datatype<br>
      <br>
      CREATE TABLE wide_table (<br>
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sample_id,<br>
      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; snp_data[]);<br>
      <br>
      Does anyone have any experience of this sort of thing?<br>
      <br>
      (1) <a moz-do-not-send="true"
        href="http://en.wikipedia.org/wiki/Genome-wide_association_study"
        target="_blank">http://en.wikipedia.org/wiki/Genome-wide_association_study</a><br>
      <br>
      --<br>
      Michael Lush<br>
    </blockquote>
    <br>
    I store all my genotype data similar to this.&nbsp; <br>
    <br>
    <tt>CREATE TABLE gen1000<br>
      (<br>
      -- Inherited from table genotypes:&nbsp; snp_number integer NOT NULL,<br>
      -- Inherited from table genotypes:&nbsp; sample_id integer NOT NULL,<br>
      -- Inherited from table genotypes:&nbsp; genotype smallint NOT NULL,<br>
      &nbsp; CONSTRAINT check1000 CHECK (snp_number &lt; 58337 AND sample_id
      &gt; 100000000 AND sample_id &lt; 101000000)<br>
      )<br>
      INHERITS (genotypes)<br>
      WITH (<br>
      &nbsp; OIDS=FALSE<br>
      );<br>
      ALTER TABLE gen1000<br>
      &nbsp; OWNER TO postgres;<br>
      COMMENT ON TABLE gen1000<br>
      &nbsp; IS '100 ANG';<br>
      <br>
      -- Index: xgen1000_sample_id<br>
      CREATE INDEX xgen1000_sample_id<br>
      &nbsp; ON gen1000<br>
      &nbsp; USING btree<br>
      &nbsp; (sample_id )<br>
      &nbsp; WITH (FILLFACTOR=100)<br>
      TABLESPACE index_tablespace;<br>
      ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id;<br>
      <br>
      -- Index: xgen1000_snp_number<br>
      CREATE INDEX xgen1000_snp_number<br>
      &nbsp; ON gen1000<br>
      &nbsp; USING btree<br>
      &nbsp; (snp_number )<br>
      &nbsp; WITH (FILLFACTOR=100)<br>
      TABLESPACE index_tablespace;</tt><br>
    <br>
    My implementation is basically a data warehouse where I am the only
    person with access to the db.<br>
    <br>
    There are several benefits to storing your genotypes this way and
    there are benefits to storing your genotypes in more of a matrix
    format.&nbsp; If you store them as your "long" table you really need to
    think about whether or not you can partition your data.&nbsp; In my case
    I'm able to partition by sample_id because all of my animals from a
    given breed (population, race, location etc) have IDs in a certain
    range.&nbsp; Furthermore, I'm able to create partitions based on which
    assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came
    from because my snp_numbers (rather than marker names) are integer
    and specifically structured so ranges of integers correspond to an
    assay.&nbsp; Therefore, a given genotype table is only as long as the
    number of animals X number of markers on the assay.&nbsp; I have one
    partition that would be very large (20000 animals X 60000 markers =
    1.2B) but what I did was further split it up into sub-partitions of
    ~220M.<br>
    <br>
    I have separate tables for marker information and sample information
    which are keyed on snp_number and sample_id.&nbsp; Given the structure
    and appropriate indexes, and check constraints I'm able to grab
    specific "chunks" of data fairly easily.<br>
    <br>
    As a side note, I take every opportunity I get to advise people not
    to store GWAS data in base format (AGCT).&nbsp; If you are using Illumina
    data at the very least store data and work with it in A/B format.&nbsp;
    Ditto for Affymetrix.&nbsp; I actually convert the A/B format to my own
    coding and store genotypes as a single small integer (genotype
    above) where AA=1, BB=2, AB=3, missing=10.&nbsp; At some point I'm going
    to change this to missing =0.&nbsp; Here's why this is important.&nbsp; With
    this coding you are able to store the genotype of an individual
    using 2 bytes.&nbsp; With this coding you can also store phase
    information if you have it... B/A=4, A/null=5, B/null =6, null/A=7,
    null/B=8, null/null=9.&nbsp; One other side benefit, if you are wanting
    to check parent/child inheritance all you need to do is find all the
    loci where parent_genotype + child_genotype = 3 and count the number
    of loci.&nbsp; Those are the only informative genotypes.&nbsp; If you're
    working in base space it's more complicated. &nbsp; Furthermore, you
    could take this data in columnar form and dump it and rewrite it as
    a string (wide) where each value is an individual genotype, which is
    about as compact as you can get it.&nbsp; You just need to write your
    analysis software to decode it.<br>
    <br>
    Bob<br>
    <br>
    <br>
  </body>
</html>


Attachment: unknown_filename
Description: text/html (6.4 KB)

In response to

pgsql-novice by date

Next:From: Gene PooleDate: 2011-12-08 15:04:37
Subject: Re: [GENERAL] Convert / Migrate From Oracle 11gR2 To PostgreSQL ? On CentOS 5.7 x86_64
Previous:From: Jean-Yves F. BarbierDate: 2011-12-08 13:24:44
Subject: Re: Big wide datasets

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group