Re: genomic locus

From: Gene Selkov <selkovjr(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: genomic locus
Date: 2017-12-22 01:39:49
Message-ID: 91B02B27-D386-4471-B4A7-1244E6ABC6AE@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Dec 18, 2017, at 6:59 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
> If you think it'd make logical sense to extend seg with a string descriptor of some sort and could come up with a name/use case that's not quite so narrowly focused as genetics alone, then I could see adding it as a secondary type in the same extension.
>
> But it's more likely that the best course would be to extract the seg extension from core, rename it, hack it as desired, and build it as an extension maintained out-of-tree.

That is exactly what I’ve been doing for a few days now, and the process is testing my sanity.

Attaching a string to an interval seems like an easy enough undertaking, and I have got it to work at the UI level, at least. The queries I wanted to be able to make against it run without problems and produce the desired results. Here is my first attempt: https://github.com/selkovjr/locus

Problems arise around queries I didn’t expect to be making and there are issues around indexing that I am not sure how to solve.

The main problem is that attaching a tag to an interval makes it incommensurate with intervals having a different tag. That makes them hard to index with an access method based on containment, such as GiST.

Problem 1. What is a union of ‘1:6000-7000’ and ‘X:10000-20000’? Intuitively, it should be NULL, however, I am not sure the method allows for that; it was developed for objects living in the same metric space. I have mechanistically reproduce the indexing methods of seg, but the resulting index is broken. All queries against an indexed table return a null result.

Problem 2. While the intersection (overlap, &c.) of any two loci produces obvious results, non-intersection does not. When I query for all loci not overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 1. I don’t want the query to return anything from other contigs, because it is obvious that features on different contigs do not overlap. I may be able to fix that by making separate functions for non-overlaps and adding a constraint to them, but that seems like a kludge.

Problem 3 (alternative to 1). I realize that any clustering can help build an efficient index, no matter how bizarre. So I could, for example, ignore the contigs altogether and build a single index tree, using only position co-ordinates and pretending that all positions are on the same contig; the question then is whether and how such lossy index will affect the ordering of query results. Can I use a separate function for ordering? I have yet to make an experiment. Not that this would be equivalent to indexing the attributes of a composite type separately (if I understood it correctly).

An alternative to neglecting the contig element might be to use it as a second dimension. Expressed that way, a union of several loci might consist of a set of contig names attached to the bounding interval. Not sure whether that makes any sense; in the first approximation, I imagine something equivalent to storing each contig’s data in a separate table with a separate index, except derived from a single actual database table, but I have no clue for how to go about doing that.

Thanks,

—Gene

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-12-22 02:13:40 Re: [HACKERS] path toward faster partition pruning
Previous Message David Rowley 2017-12-22 01:37:49 Re: [HACKERS] Runtime Partition Pruning