Re: [Fwd: PostgreSQL to index MEDLINE]

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Justin Clift <justin(at)postgresql(dot)org>
Cc: PostgreSQL General Mailing List <pgsql-general(at)postgresql(dot)org>, Olaf Bininda-Emonds <Olaf(dot)Bininda(at)tz(dot)agrar(dot)tu-muenchen(dot)de>
Subject: Re: [Fwd: PostgreSQL to index MEDLINE]
Date: 2003-05-27 06:39:15
Message-ID: Pine.GSO.4.55.0305271035460.9108@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Olaf,

what's the number of unique words (you want to be searched) in your collection ?
You may try contrib/tsearch V2 from http://www.sai.msu.su/~megera/postgres/gist/
May be I miss something but it looks like rather ordinary search system.

Oleg
On Tue, 27 May 2003, Justin Clift wrote:

> Hi everyone,
>
> Does anyone feel like assisting Olaf here?
>
> Regards and best wishes,
>
> Justin Clift
>
>
> -------- Original Message --------
> Subject: PostgreSQL to index MEDLINE
> Date: Wed, 21 May 2003 11:47:39 +0200
> From: Olaf Bininda-Emonds <Olaf(dot)Bininda(at)tz(dot)agrar(dot)tu-muenchen(dot)de>
> To: justin(at)postgresql(dot)org
>
> Hi,
>
> I have a question concerning the suitability of PostgreSQL to yield a
> word index of MEDLINE, an XML-formatted database of over 12 000 000
> articles from the biomedical literature. A computer scientist friend
> suggested that PostgreSQL might be the answer that we require.
>
> Briefly, our project is as follows:
>
> We would like to establish word co-occurences within each MEDLINE
> entry, with one word representing a gene and the other word
> representing one or more user-input keywords of functional traits. As
> the MEDLINE database is over 30 GB in size (and split into 396 separate
> files), we need an efficient, indexed way to search it.
>
> Our specific needs are as follows:
>
> 1) An index of all the words in the fields "title", "abstract" (not
> always present), and "medical subject headings" such that I can search
> the index to reveal which article (represented by the MedlineID field)
> contains a desired word. This will have two purposes:
>
> 1a) To build a subindex detailing which human genes are being referred
> to in an article. This is slightly complicated, in part because of the
> ever-changing nomenclature of genes. Officially, a gene is known by an
> alphanumeric symbol (of up to eight characters; e.g., BRCA1) and a name
> of one or more words ("breast cancer 1, early onset"). Searching for
> the symbol itself is insufficient, either because the symbol has not
> been used (e.g., older references) or can have other, common meanings
> (e.g., the symbol for "pyruvate carboxylase" is PC). Therefore,
> determining whether a gene is being mentioned in a given article
> requires a combination of the gene symbol being present and/or a
> sufficient number of the words in its name ("sufficient" being
> determined by the length of the gene symbol). I have a perl script that
> can determine this; however, it currently searches through all 396
> files on a pattern matching basis. Using an index would obviously be
> that more efficient, allowing me to query for a lists of articles that
> contain for a given gene, the gene symbol and each of the words in the
> gene name. The intersection of these lists would then be the articles
> held to contain a reference to that gene.
>
> 1b) To enable dynamic searches of the index for user-input keywords of
> functional traits. Searching for phrases ("lipid metabolism") would be
> ideal, but not necessary.
>
> In both cases, the matches would be for exact words (i.e., no stem
> searches).
>
> 2) The program should run in a UNIX environment (as implemented in Mac
> OS X) as I will be writing my own (hideously-inefficient) perl script
> to access it for queries. Generally, the protocol will be as follows:
>
> a) input gene-article subindex into memory
> b) obtain list of user-input keywords (and phrases, which will be split
> into individual words if phrases cannot be searched for)
> c) query index for MEDLINE articles containing each keyword in turn;
> articles will be scored according to the number of keywords they
> contain (i.e., an article with two keywords is scored twice as heavily
> as an article with only one keyword)
> d) for each article in the list in (c), determine whether each of ca.
> 16000 genes is also mentioned in that article (via subindex); scores
> for genes will be positively or negatively incremented according to
> whether or not they are present
> e) for each gene, determine its association score (= number of articles
> in which it co-occurs with at least one keyword - number of articles
> containing a keyword but not that gene).
>
> Therefore, the index should be written to disk as it will be accessed
> repeatedly for each keyword.
>
> ---
>
> I would appreciate your advice as to whether PostgreSQL is suitable for
> this task and, if so, how best to implement it and set up the word
> index. I have attached a sample of the MEDLINE database for you.
>
> With thanks in advance for your help!
>
> Best,
>
> Olaf
>
> ------------------------------------------------------------------------
> -----------------
> Olaf Bininda-Emonds
> Lehrstuhl fЭr Tierzucht
> Technical University of Munich
> Alte Akademie 12
> 85354 Freising-Weihenstephan
> Germany
>
> Phone: +49 (0)8161 713741
> Fax: +49 (0)8161 713107
> e-mail: Olaf(dot)Bininda(at)tierzucht(dot)tum(dot)de
> WWW: http://www.tierzucht.tum.de/Bininda-Emonds/
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Justin Clift 2003-05-27 06:44:03 Re: PostgreSQL 7.3.1 on WinXP
Previous Message Vincent Hikida 2003-05-27 05:52:50 Re: newbie sql question...