Skip site navigation (1) Skip section navigation (2)

tsvector term positions and character offset

From: Yoann Moreau <yoann(dot)moreau(at)univ-avignon(dot)fr>
To: pgsql-hackers(at)postgresql(dot)org
Subject: tsvector term positions and character offset
Date: 2011-11-24 15:08:50
Message-ID: 4ECE5E02.2010207@univ-avignon.fr (view raw or flat)
Thread:
Lists: pgsql-hackers
Hello, I'm working on text data, actually some tsvectors of the text. 
The tsvector provides terms and positions for each term, I would need to 
map these positions to the character offsets of the terms in the 
original text.

'This is an example text for example'
tsvector -> 'an':3 'exampl':4,7 'for':6 'is':2 'text':5 'this':1
What I need would be for the first term 'This' : 0, or the term 
'example' : 11,28.

I've searched for anything able to do that without success (also asked 
on general pg list).
As the offset positions seem to be not stored or used at any time in the 
fulltext functions, the only way I figured out would be to parse the 
text again counting terms AND characters read. I coded this function as 
a very very dirty external C function, with many tsearch code copied 
because it can't be used outside of the source file.

My questions
1) Is there any other way to achieve what I need ?
2) Could my need be part of future more general functionality of the 
tsearch module ?
If not, any suggestion about the way to code it as clean and robust as 
possible ?

Regards,
Yoann Moreau

pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2011-11-24 15:15:49
Subject: Re: PL/Python SQL error code pass-through
Previous:From: Robert HaasDate: 2011-11-24 15:02:13
Subject: Re: Notes on implementing URI syntax for libpq

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group