Re: How to switch off Snowball stemmer for tsearch2?

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Dmitry Koterov <dmitry(at)koterov(dot)ru>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-23 05:27:58
Message-ID: Pine.LNX.4.64.0708230925240.2727@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 23 Aug 2007, Dmitry Koterov wrote:

> Oh! Thanks!
>
> delete from pg_ts_cfgmap where dict_name = ARRAY['ru_stem'];
>
> solves the root of the problem. But unfortunately
> russian.med(ru_ispell_cp1251) contains all Russian names, so "Ivanov"
> is converted to
> "Ivan" by ispell too. :-(
>
> Now
>
> select lexize('ru_ispell_cp1251', 'Дмитриев') -> "Дмитрий"
> select lexize('ru_ispell_cp1251', 'Иванов') -> "Иван"
> - it is completely wrong!
>
> I have a database with all Russian name, is it possible to use it (how?) to

if you have such database why just don't write special dictionary and
put it in front ?

> make lexize() not to convert "Ivanov" to "Ivan" even if the ispell
> dicrionary contains an element for "Ivan"? So, this pseudo-code logic is
> needed:
>
> function new_lexize($string) {
> $stem = lexize('ru_ispell_cp1251', $string);
> if ($stem in names_database) return $string; else return $stem;
> }
>
> Maybe tsearch2 implements this logic already?

sure, it's how text search mapping works. Dmitry, seems your company could be
my client :)

>
> On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>
>> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>>
>>> Suppose I cannot add such synonyms, because:
>>>
>>> 1. There are a lot of surnames, cannot take care about all of them.
>>> 2. After adding a new surname I have to re-calculate all full-text
>> indices,
>>> it costs too much (about 10 days to complete the recalculation).
>>>
>>> So, I neet exactly what I ast - switch OFF stem guessing if a word is
>> not in
>>> the dictionary.
>>
>> no problem, just modify pg_ts_cfgmap, which contains mapping
>> token - dictionaries.
>>
>> if you change configuration you should rebuild tsvector and reindex.
>> 10 days looks very suspicious.
>>
>>
>>>
>>> On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>>>
>>>> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
>>>>> Now Snowball stemmer is also configured.
>>>>>
>>>>> How to properly switch OFF Snowball stemmer for Russian without
>> turning
>>>> off
>>>>> ispell stemmer? (It is really needed, because "Ivanov" is not the same
>>>> as
>>>>> "Ivan".)
>>>>> Is it enough and correct to simply delete the row from pg_ts_dict or
>>>> not?
>>>>>
>>>>> Here is the dump of pg_ts_dict table:
>>>>
>>>> don't use dump, plain select would be better. In your case, I'd
>>>> suggest to follow standard way - create synonym file like
>>>> ivanov ivanov
>>>> and use it before other dictionaries. Synonym dictionary will recognize
>>>> 'Ivanov' and return 'ivanov'.
>>>>
>>>>>
>>>>>
>> dict_name dict_init dict_initoption dict_lexize dict_comment
>>>>> en_ispell spell_init(internal)
>>>>>
>>>>
>> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
>>>>> spell_lexize(internal,internal,integer)
>>>>> en_stem snb_en_init(internal) contrib/english.stop
>>>>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
>>>>> ispell_template spell_init(internal)
>>>>> spell_lexize(internal,internal,integer) ISpell interface. Must have
>>>> .dict
>>>>> and .aff files
>>>>> ru_ispell_cp1251 spell_init(internal)
>>>>>
>>>>
>> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
>>>>> spell_lexize(internal,internal,integer)
>>>>> ru_stem_cp1251 snb_ru_init_cp1251(internal)
>>>>> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
>>>>> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
>>>>> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
>>>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
>> KOI8
>>>>> Encoding
>>>>>
>> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
>>>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
>> UTF8
>>>>> Encoding
>>>>>
>>>>
>> simple dex_init(internal) dex_lexize(internal,internal,integer)
>>>>> Simple example of dictionary.
>>>>> synonym syn_init(internal)
>>>>> syn_lexize(internal,internal,integer) Example of synonym dictionary
>>>>> thesaurus_template thesaurus_init(internal)
>>>>> thesaurus_lexize(internal,internal,integer,internal) Thesaurus
>>>> template,
>>>>> must be pointed Dictionary and DictFile
>>>>>
>>>>
>>>> Regards,
>>>> Oleg
>>>> _____________________________________________________________
>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>>
>>>> ---------------------------(end of
>> broadcast)---------------------------
>>>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
>> your
>>>> message can get through to the mailing list cleanly
>>>>
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Kristo Kaiv 2007-08-23 05:51:36 table column vs. out param [1:0]
Previous Message Tony Caduto 2007-08-23 05:00:15 PostgreSQL vs Firebird feature comparison finished