Re: GIN : Working with term positions

From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: GIN : Working with term positions
Date: 2011-10-27 10:23:49
Message-ID: CAF6yO=0-s_r_ZjQeP+Vwj8PiNXJ0X76FRj+ga_zdTSwjaTmSow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

2011/10/26 Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>:
> On 21/10/11 12:23, Yoann Moreau wrote:
>>
>> Hello,
>> I'm using a GIN index for a text column on a big table. I use it to rank
>> the rows, but I also need to get the term positions for each document of a
>> subset of documents. I assume these positions are stored in the index,
>> because doc says positions can be used for cover density ranking and because
>> to_tsvector function gives them :
>>
>> select * from to_tsvector('I get lexemes and I get term positions.');
>>            to_tsvector
>> ----------------------------------------
>>  'get':2,6 'lexem':3 'posit':8 'term':7
>>
>> I can get the term positions with to_tsvector but only by parsing the
>> result string, is there any more handy way ? Something like :
>> select * from term_and_positions('I get lexemes and I get term
>> positions.');
>>  term    | positions
>> ---------+-----------
>>   'get' |     {2,6}
>>  'lexem' |       {3}
>>
>>
>> Then, from the term positions, I need to get the character offset of these
>> term positions. I assume it is NOT stored in the GIN index. By character
>> offset I mean the character count from string begining to the term. For the
>> previous example it would be : 'get' --> {2,20}.
>>
>> I thought about using ts_headline to return the whole text with term
>> tagged and then parse it to compute the character offsets from the tags. But
>> this function is very slow, seems like it does not use the GIN index at all.
>> And I suppose it can't because there is no way to know from a term position
>> where its substring is in the text.
>>
>> Now I think the only solution is to make my own C function parsing the
>> text like to_tsvector does and counting terms AND characters read from the
>> begining of the text to match them. I got a look on the code, and it does
>> not seems easy to do because characters offset or string length are never
>> used by the parsetext function (ts_parse.c). If you have any other
>> suggestion, would love to hear that !
>>
>> Regards, Yoann Moreau
>
> Hello again, I'm sorry my need is actually a bit different than what I have
> asked. I need to get the term positions using the GIN index, when I query my
> text column, i.e. for a given term.
>
> For example for 2 rows of a 'docs' table with a text column 'text' :
> 'I get lexemes and I get term positions.'
> 'Did you get the positions ?'
>
> I'd need a function like this :
> select term_positions(text, 'get') from docs;
>  id_doc | positions
> ---------+-----------
>       1 |     {2,6}
>       2 |       {3}
>
> I know it can't be as simple as this, because the query would first need to
> be filtered with a WHERE using a tsquery and this can't be done in the
> function called like in my example. I suppose such a feature does not exist,
> but is there any way to get the positions of the matching terms when
> querying a GIN index ?
>
> The only possible way I imagine right now is to firstly filter the rows with
> "to_tsvector(text) @@ to_tsquery('get')" and then call "to_tsvector(text)"
> for the n highest ranked rows, parsing the string returned by the function
> to find the term and its positions. But would be way more efficient to get
> them directly at the first call when matching the terms with @@ operator. I
> know it would be impossible if the query contain more than 1 term because it
> can't return 2 arrays of position in one row (i.e. for one document), but
> for now I'm trying to do this for 1 query term.
> Any help or advice would be welcome !
>
> By the way, I have done the C function computing the character offset of a
> given term position for a text column. It's not done in a good way, but it's
> more a topic for hackers postgresql list.
>

Don't forget when you success that word positions are affected by the
word removed by stop-words.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

In response to

Browse pgsql-general by date

  From Date Subject
Next Message mailtolouis2020-postgres@yahoo.com 2011-10-27 10:49:16 Re: pglesslog for Postgres 9.1.1
Previous Message Simon Riggs 2011-10-27 09:00:59 Re: Can someone help explain what's going on from the attached logs?