Re: Adding a suffix array index

From: Adam Witney <awitney(at)sghms(dot)ac(dot)uk>
To: Troels Arvin <troels(at)arvin(dot)dk>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adding a suffix array index
Date: 2004-11-19 12:45:43
Message-ID: BDC39B77.3C2FC%awitney@sghms.ac.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi Troels,

This is not related to the database aspects of your question... But there
are more than 4 possible letters in DNA sequences, 16 in fact. Depending on
the accuracy of the DNA sequences you are storing, you may come across
ambiguity DNA bases, so your type will have to take these into account. The
possible letters (according to IUPAC rules) are listed at the bottom of this
page

http://gnomix.ansci.umn.edu/FASTA.html

Cheers

Adam

> Hello,
>
> I'm working on a thesis project where I explore the addition of a
> specialized, bioinformatics-related data type to a RDBMS. My choice of
> RDBMS is PostgreSQL, of course, and I've started by adding a "dnaseq" (DNA
> sequence) data type, using PostgreSQL's APIs for type additions.
>
> The idea is to try to make it practical and even "attractive" to work with
> DNA sequences in an RDBMS. My starting goal is to make it viable to work
> with sequences in the 50-500 million base range. Some may think that
> RDBMSes and long chunks of data don't match well. My opinion is that the
> increasing power of computers and RDBMS software should at some point make
> it possible to work with DNA sequences in a "normal" data management
> setting like a RDBMS, instead of solely using stand-alone tools and
> stand-alone data files. Anyways, it's an open question if my hypothesis is
> right.
>
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions. Part of my current code concerns
> packing DNA characters: As the alphabet of DNA strings is very small (four
> characters), it seems like a straigt-forward optimization to store each
> character in two bits. A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...
>
> Although B-trees and hash indexes may be used with my dnaseq type, they
> aren't very interesting for long DNA sequences. I would like to add
> support for adding a suffix array or suffix tree based index to dnaseq
> columns where the user expects long sequences.
>
> My review of the latest academic work on indexing of long sequences
> indicates that suffix arrays are probably the way to go: Recent advances
> in suffix array algorithms make them more attractive than suffix trees
> (because the take up less space). And since the DNA data I'm currently
> trying to support aren't separated in words, a string B-tree doesn't seem
> to be relevant.
>
> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".
>
> I have to immediate problems:
> 1. The algoriths I've found don't indicate a good way to
> update index arrays. So I'm thinking of implementing
> a sort of index-"cluster" which has one sub-index per
> indexed value. That way, I don't have to worry about
> updating a large index covering all the indexes values
> in the column. Is it conceivable to create such an
> index "cluster"?
> 2. Does someone know of interesting documentation (perhaps
> in the form of interesting code comments) which I should
> read, as a basis for creating a non-standard index type
> in PostgreSQL?

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Troels Arvin 2004-11-19 13:13:48 Re: Adding a suffix array index
Previous Message Hannu Krosing 2004-11-19 12:38:20 Re: Adding a suffix array index