Re: Storing number '001' ?

From: "Josh Berkus" <josh(at)agliodbs(dot)com>
To: Charles Hauser <chauser(at)acpub(dot)duke(dot)edu>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Storing number '001' ?
Date: 2001-12-12 16:45:42
Message-ID: web-526341@davinci.ethosmedia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Chuck,

(I'm quoting most of your text -- my comments are embedded in your table
design)

> After writing this am, I think I may have begun to understand how to
> use the SERIAL id;
>
> A 'MASTER TABLE' containing ONLY clone description/component parts
> (894001A01.x1)
>
> CREATE TABLE clone(
> clone_id SERIAL PRIMARY KEY,
> project INTEGER NOT NULL,
> plate INTEGER NOT NULL,
> p_row CHAR(1) NOT NULL,
> p_column INTEGER NOT NULL,
> read CHAR(1) NOT NULL,
> ver INTEGER NOT NULL,
> UNIQUE(project,plate,p_row,p_column,read,ver)
> );
Josh: As I said before, it is up to you whether you make clone_id the
primary key and the 6-column combo a unique index, or vice-versa. Both
values are what is known as Candidate Keys, and will function to select
a unique record. That being said, the approach you have taken above is
probably marginally better as you will be Joining the clone_id more
often than the 6 columns.
>
> DATA TABLES which refer back to MASTER
>
> CREATE TABLE clone_fasta(
> clone_id SERIAL PRIMARY KEY,
J: No, you want to give this table its own ID column:
clone_fasta_id SERIAL PRIMARY KEY,
> fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE
> ??
J: I suggest instead:
clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELECT
CASCADE;
(the last modifier means that if you delete the clone, the fasta record
is deleted too. On Update Cascade is only relevant if you are going to
be manually changing the clone_id value, which you won't)
> seq TEXT NOT NULL,
> length INTEGER NOT NULL
> );
>
> CREATE TABLE clone_qual(
> clone_id SERIAL PRIMARY KEY,
> fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE
> ??
J: Per my comments above:
clone_qual_id SERIAL PRIMARY KEY,
clone_id INTEGER NOT NULL REFERENCES clone(clone_id) ON DELETE
CASCADE,
> qual INTEGER[] NOT NULL,
> );

You should NOT use an array to store production data. Array columns
are not relational, cannot be properly indexed, and are difficult to
query. I use array columns only in temporary tables and buffer tables.
Can you describe the data you are storing in Qual so that I can
reccommend a suitable subtable structure?

>
> CREATE TABLE gb_accessions(
> clone_id SERIAL PRIMARY KEY,
> fk_clone DATATYPE?? NOT NULL REFERENCES clone ON UPDATE ?? ON DELETE
> ??
J: Same as above. I think you can figure this one out.
> accn_no varchar(12) NOT NULL,
> gi_no varchar(12) NOT NULL
> );
>
> Close ?

Close.

The reasons you want to change the Foriegn Key column to match the name
of the key column in the parent table are: 1) clarity, and 2) A useful
shortcut in PostgreSQL: "SELECT * FROM clone JOIN clone_fasta USING
(clone_id)" that will save you some typing.

More comments below:

> Two sets of numbers are used in tracking. clone_id identifies an
> individual bacterial colony/clone. contig_id identifies a sequence
> assembled from multiple clones.
> e.g.
> clone A: GATTCTCTCTCTCGACGAGC
> clone B: GACGAGCATTATCTACGCATACTACTCATA
> contig (A+B): GATTCTCTCTCTCGACGAGCATTATCTACGCATACTACTCATA
>
>
> 1. clone id: 894001A01.x1:
> 894: sequencing project (3 or 4 digit number)
> 001: plate clone resides in (96 well microtiter plates)
> A: plate row (A-H)
> 01: plate column (1-12)
> x: read, which primer was used to sequence the clone (x or y)
> 1: ver, some clones are sequenced more than once (1,2...n)
>
> 2. contig id: 20010822.123.1
> 20010822: date contig was assembled
> 123: contig_no
> 1: ver, several possibilities may result
>
> The assembled sequence of a CONTIG represents that of a gene.
>
> The TABLES I have at present are:
>
> clone_fasta:
> clone_id (894001A01.x1) in the 6 columns
> seq TEXT: DNA sequence
> length INTEGER: # of nucleotides/bases in the sequence
>
> clone_qual:
> clone_id (894001A01.x1) in the 6 columns
> qual INTEGER[]: array of quality values for each base, (1-2
> digit integer {1 9 9 45 38 44 ...}
> several columns of descriptive info
>
> gb_accessions:
> clone_id (894001A01.x1) in the 6 columns
> accn_no (BG842967): each clone sequence is submitted to
> GenBank, a repository of public sequences
> gi_no (14224151): GenBank assigns each clone an accession
> number (accn_no) and a gi_no.
>
> contig: contig_id (20010822.123.1) stored in 3 columns
> assembly_date: 20010822, date

Keep in mind that this can be stored as a real DATE, then formatted to
the above on query!

> contig_no: 123
> ver: 1
> seq: assembled sequence from several clones
> length: length of assembled sequence
> ests: array of clone_ids which were used to generate the
> contig {894001A01.x1, 963125H01.y2}

A lot of the numbering above could be automated through Postgres
functions.

>
> homolog: Contains putative annotation data, i.e. what gene is it
> assembly_date: 20010822, date
> contig_no: 123
> ver: 1
> several columns of descriptive info
> homolog: tentative description of 'gene' TEXT
>
> library: Info on the different DNA libraries being sequenced
> project: sequencing project designation, ie '894', same #
> seen in clones, 894001A01.x1.
> library: text describing library
> details: more detailed description
>
> TABLES clone_fasta, clone_qual and gb_accessions all contain info
> regarding individual bacterial clones derived from a 'library' of
> clones. They are interconnected by the clone id (894001A01.x1)
>
> TABLE library contains info explaining under what conditions clone
> 894001A01, and ALL clone prefixed by '894' were made. Also all other
> project descriptions (832,833,925,963, 1024,1031...) The
> descriptions detail growth conditions used to produce the DNA
> library.
>
> TABLES homolog and contig contain 'meta' info about the sequence and
> identity of DNA sequences obtained from assembling related clones. A
> contig '20010822.123.1' might be assembled from two clones
> (894001A01.x1, 963125H01.y2).

You'll want to also construct relational tables. For example, if a
contig is assembled from two clones, you'll want a table called
contig_clones that relates in this fashion:

clones contig_clones contig
clone_id -----------clone_id|contig_id----------contig_id
clone_id /--clone_id|contig_id
clone_id /
clone_id------/

> There, now you know all I do, and more so.
> There are more tables to be added in the future, but the data you saw
> on the web site resides in the above tables, or predicessors to them.

That's a great start on a spec document. In the future, if you have to
do this again, you'll want to do the above kind of assessment before any
actual table design.

To finish this off, I'd like some more info:
1. What is the information you want to store in arrays? You should be
sotring this information in tables for the reasons I outlined above.
2. For the 3 "clone" tables, is the relationship between them one-to-one
or one-to-many?

-Josh

______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh(at)agliodbs(dot)com
and data management solutions (415) 565-7293
for law firms, small businesses fax 621-2533
and non-profit organizations. San Francisco

Browse pgsql-novice by date

  From Date Subject
Next Message Jason Earl 2001-12-12 17:30:33 Re: problem untarring the latest release
Previous Message S P Arif Sahari Wibowo 2001-12-12 16:24:24 Varying bit field: to set and query a particular bit