Indexing unknown words with Tsearch2

From: Greg Maitrallain <greg(dot)maitrallain(at)evodia(dot)fr>
To: pgsql-general(at)postgresql(dot)org
Subject: Indexing unknown words with Tsearch2
Date: 2009-04-01 13:38:07
Message-ID: 49D36E3F.1080207@evodia.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

First of all, excuse my poor english :)

I'm working on a fulltext database with tsearch2, which contains french
historical writings.
I'm using the fr_ispell dictionnary that can be found here :
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
(ispell-french.tar.gz
<http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/dicts/ispell/ispell-french.tar.gz>
- submitted by Max Jacob)
The database encoding is LATIN1

The problem is the writings contains many names of personnalities. For
example : Churchill (the database covers WWII). But when I try to search
for these names, nothing is found.

I tried many things, like this introduction :
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
And I think the problem's root is that no lexem is found (I could even
say an empty lexem is found).

With the default en_stem dictionnary, I get this :

SELECT lexize('en_stem', 'churchill');
"{churchil}"

Then, I try to add the french dictionnary :

INSERT INTO pg_ts_dict
(SELECT 'fr_ispell',
dict_init,
'DictFile="/home/.../french.dict",'
'AffFile="/home/.../french.aff",'
'StopFile="/home/.../french.stop"',
dict_lexize
FROM pg_ts_dict
WHERE dict_name = 'ispell_template');

And the result is :

SELECT lexize('fr_ispell', 'churchill');
""

My questions are :
- Is it OK to give empty string as a result for a word that is not in
the dictionnary, neither in the stop words ?
- Is there a way to get the word itself as a result, when the word is
not in the dictionnary, neither in the stop words ?
- If yes, how ?

I'm also interested in any information you could give me...
Many thanks !

Greg Maitrallain.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2009-04-01 13:55:44 Re: Indexing unknown words with Tsearch2
Previous Message Patrick Desjardins 2009-04-01 12:37:36 Re: [GENERAL] Re: [GENERAL] ERROR: XX001: could not read block 2354 of relation…