Re: genomic locus

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Gene Selkov <selkovjr(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: genomic locus
Date: 2017-12-22 07:53:46
Message-ID: 02df3c8c-1e20-c3e3-1956-f998bcfef21b@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hmm, would you try to implement separate type for querying? Similar to tsquery,
lquery (for ltree), jsquery etc.

Gene Selkov wrote:
>
>> On Dec 17, 2017, at 7:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com
>> <mailto:robertmhaas(at)gmail(dot)com>> wrote:
>>
>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov <selkovjr(at)gmail(dot)com
>> <mailto:selkovjr(at)gmail(dot)com>> wrote:
>>> I need a data type to represent genomic positions, which will consist of a
>>> string and a pair of integers with interval logic and access methods. Sort
>>> of like my seg type, but more straightforward.
>>
>> Have you thought about just using a composite type?
>
> Yes, I have. That is sort of what I have been doing; a composite type certainly
> gets the job done but I don’t feel it reduces query complexity, at least from
> the user’s point of view. Maybe I don’t know enough.
>
> Here’s an example of how I imagine a composite genomic locus (conventionally
> represented as text ‘:’ integer ‘-‘ integer):
>
> CREATE TYPE locus AS (contig text, coord int4range);
> CREATE TABLE test_locus (
>   pos locus,
>   ref text,
>   alt text,
>   id text
> );
> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
> \copy test_locus from test_locus.tab
>
> Where test_locus.tab has stuff like:
>
> (chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2
> (chr3,"[178916939,178916948]")AGAAAAGATCHP2_PIK3CA_2
> (chr3,"[178916940,178916941]")GACHP2_PIK3CA_2
> (chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2
> (chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2
> (chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2
> (chr3,"[178916944,178916945]")AGCHP2_PIK3CA_2
> (chr3,"[178916945,178916946]")GCCHP2_PIK3CA_2
> (chr3,"[178916945,178916946]")GTCHP2_PIK3CA_2
> (chr3,"[178916945,178916948]")GATCHP2_PIK3CA_2
>
> When the table is loaded, I can pull the subset shown above with this query:
>
> SELECT * FROM test_locus WHERE (pos).contig = 'chr3' AND (pos).coord &&
> '[178916937, 178916948]’;
>               pos               |    ref    | alt |      id
> --------------------------------+-----------+-----+---------------
>  (chr3,"[178916937,178916941)") | GAA       |     | CHP2_PIK3CA_2
>  (chr3,"[178916939,178916949)") | AGAAAAGAT |     | CHP2_PIK3CA_2
>  . . . .
>
> So far so good. It gets the job done. However, it is only a small step towards a
> fully encapsulated, monolithic type I want it to be. The above query It is
> marginally better than its atomic-type equivalent:
>
> SELECT * FROM test WHERE contig = 'chr3' AND greatest(start, 178916937) <=
> least(stop, 178916948);
>  contig |   start   |   stop    |    ref    | alt |      id
> --------+-----------+-----------+-----------+-----+---------------
>  chr3   | 178916937 | 178916940 | GAA       |     | CHP2_PIK3CA_2
>  chr3   | 178916939 | 178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
>  . . . .
>
> and it requires addition syntax transformations steps to go from conventional
> locus representation 'chr3:178916937-178916940' to composite
> '(chr3,"[178916937,178916940]”)’ and back.
>
> Of course, the relative benefits of partial encapsulation I achieve by bundling
> text with int4range accumulate, compared to (text, int4, int4), as queries grow
> more complex. But because the elements of a composite type still require a
> separate query term for each of them (unless there is some magic I am not aware
> of), the complexity of a typical query I need to run exceeds my feeble
> sight-reading capacity. I want things that are conceptually simple to be
> expressed in simple terms, if possible.
>
> Like so:
>
> CREATE EXTENSION locus;
> CREATE TABLE test_locus (
>   pos locus,
>   ref text,
>   alt text,
>   id text
> );
> \copy test_locus from data/oncomine.hotspot.tab
>
> SELECT * FROM test_locus WHERE pos && 'chr3:178916937-178916948';
>            pos            |    ref    | alt |      id
> --------------------------+-----------+-----+---------------
>  chr3:178916937-178916940 | GAA       |     | CHP2_PIK3CA_2
>  chr3:178916939-178916948 | AGAAAAGAT |     | CHP2_PIK3CA_2
>  chr3:178916940-178916941 | G         | A   | CHP2_PIK3CA_2
>  chr3:178916943-178916944 | A         | G   | CHP2_PIK3CA_2
>  chr3:178916943-178916946 | AAG       |     | CHP2_PIK3CA_2
>  chr3:178916943-178916952 | AAGATCCTC |     | CHP2_PIK3CA_2
>  chr3:178916944-178916945 | A         | G   | CHP2_PIK3CA_2
>  chr3:178916945-178916946 | G         | C   | CHP2_PIK3CA_2
>  chr3:178916945-178916946 | G         | T   | CHP2_PIK3CA_2
>  chr3:178916945-178916948 | GAT       |     | CHP2_PIK3CA_2
> (10 rows)
>
> I have encountered some pesky geometry / indexing problems while building this
> extension (https://github.com/selkovjr/locus), but I hope I can solve them at
> least at the level afforded by the composite type, while keeping the clean
> interface of a monolithic type. I understand I could probably achieve the same
> cleanliness by defining functions and operators over the complex type, but by
> the time I’m done with that, will I have coded about the same amount of stuff as
> required to build an extended type?
>
>
> Regards,
>
> —Gene
>
>
>
>

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-12-22 08:10:30 Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256
Previous Message Teodor Sigaev 2017-12-22 07:49:12 Re: genomic locus