Re: GIN : Working with term positions

From: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: GIN : Working with term positions
Date: 2011-10-26 13:38:19
Message-ID: 4EA80D4B.4050906@univ-avignon.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 21/10/11 12:23, Yoann Moreau wrote:
> Hello,
> I'm using a GIN index for a text column on a big table. I use it to
> rank the rows, but I also need to get the term positions for each
> document of a subset of documents. I assume these positions are stored
> in the index, because doc says positions can be used for cover density
> ranking and because to_tsvector function gives them :
>
> select * from to_tsvector('I get lexemes and I get term positions.');
> to_tsvector
> ----------------------------------------
> 'get':2,6 'lexem':3 'posit':8 'term':7
>
> I can get the term positions with to_tsvector but only by parsing the
> result string, is there any more handy way ? Something like :
> select * from term_and_positions('I get lexemes and I get term
> positions.');
> term | positions
> ---------+-----------
> 'get' | {2,6}
> 'lexem' | {3}
>
>
> Then, from the term positions, I need to get the character offset of
> these term positions. I assume it is NOT stored in the GIN index. By
> character offset I mean the character count from string begining to
> the term. For the previous example it would be : 'get' --> {2,20}.
>
> I thought about using ts_headline to return the whole text with term
> tagged and then parse it to compute the character offsets from the
> tags. But this function is very slow, seems like it does not use the
> GIN index at all. And I suppose it can't because there is no way to
> know from a term position where its substring is in the text.
>
> Now I think the only solution is to make my own C function parsing the
> text like to_tsvector does and counting terms AND characters read from
> the begining of the text to match them. I got a look on the code, and
> it does not seems easy to do because characters offset or string
> length are never used by the parsetext function (ts_parse.c). If you
> have any other suggestion, would love to hear that !
>
> Regards, Yoann Moreau

Hello again, I'm sorry my need is actually a bit different than what I
have asked. I need to get the term positions using the GIN index, when I
query my text column, i.e. for a given term.

For example for 2 rows of a 'docs' table with a text column 'text' :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
id_doc | positions
---------+-----------
1 | {2,6}
2 | {3}

I know it can't be as simple as this, because the query would first need
to be filtered with a WHERE using a tsquery and this can't be done in
the function called like in my example. I suppose such a feature does
not exist, but is there any way to get the positions of the matching
terms when querying a GIN index ?

The only possible way I imagine right now is to firstly filter the rows
with "to_tsvector(text) @@ to_tsquery('get')" and then call
"to_tsvector(text)" for the n highest ranked rows, parsing the string
returned by the function to find the term and its positions. But would
be way more efficient to get them directly at the first call when
matching the terms with @@ operator. I know it would be impossible if
the query contain more than 1 term because it can't return 2 arrays of
position in one row (i.e. for one document), but for now I'm trying to
do this for 1 query term.
Any help or advice would be welcome !

By the way, I have done the C function computing the character offset of
a given term position for a text column. It's not done in a good way,
but it's more a topic for hackers postgresql list.

Regards, Yoann Moreau

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message salah jubeh 2011-10-26 13:38:29 All and ANY
Previous Message Thomas Guettler 2011-10-26 10:35:12 Docs: Add Version Info: New since Version x.y ....