Re: How to switch off Snowball stemmer for tsearch2?

From: "Dmitry Koterov" <dmitry(at)koterov(dot)ru>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to switch off Snowball stemmer for tsearch2?
Date: 2007-08-22 22:32:34
Message-ID: d7df81620708221532x4a4d62f6k6c0f0923df413771@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Oh! Thanks!

delete from pg_ts_cfgmap where dict_name = ARRAY['ru_stem'];

solves the root of the problem. But unfortunately
russian.med(ru_ispell_cp1251) contains all Russian names, so "Ivanov"
is converted to
"Ivan" by ispell too. :-(

Now

select lexize('ru_ispell_cp1251', 'Дмитриев') -> "Дмитрий"
select lexize('ru_ispell_cp1251', 'Иванов') -> "Иван"
- it is completely wrong!

I have a database with all Russian name, is it possible to use it (how?) to
make lexize() not to convert "Ivanov" to "Ivan" even if the ispell
dicrionary contains an element for "Ivan"? So, this pseudo-code logic is
needed:

function new_lexize($string) {
$stem = lexize('ru_ispell_cp1251', $string);
if ($stem in names_database) return $string; else return $stem;
}

Maybe tsearch2 implements this logic already?

On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>
> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
>
> > Suppose I cannot add such synonyms, because:
> >
> > 1. There are a lot of surnames, cannot take care about all of them.
> > 2. After adding a new surname I have to re-calculate all full-text
> indices,
> > it costs too much (about 10 days to complete the recalculation).
> >
> > So, I neet exactly what I ast - switch OFF stem guessing if a word is
> not in
> > the dictionary.
>
> no problem, just modify pg_ts_cfgmap, which contains mapping
> token - dictionaries.
>
> if you change configuration you should rebuild tsvector and reindex.
> 10 days looks very suspicious.
>
>
> >
> > On 8/22/07, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> >>
> >> On Wed, 22 Aug 2007, Dmitry Koterov wrote:
> >>
> >>> Hello.
> >>>
> >>> We use ispell dictionaries for tsearch2 (ru_ispell_cp1251)..
> >>> Now Snowball stemmer is also configured.
> >>>
> >>> How to properly switch OFF Snowball stemmer for Russian without
> turning
> >> off
> >>> ispell stemmer? (It is really needed, because "Ivanov" is not the same
> >> as
> >>> "Ivan".)
> >>> Is it enough and correct to simply delete the row from pg_ts_dict or
> >> not?
> >>>
> >>> Here is the dump of pg_ts_dict table:
> >>
> >> don't use dump, plain select would be better. In your case, I'd
> >> suggest to follow standard way - create synonym file like
> >> ivanov ivanov
> >> and use it before other dictionaries. Synonym dictionary will recognize
> >> 'Ivanov' and return 'ivanov'.
> >>
> >>>
> >>>
> dict_name dict_init dict_initoption dict_lexize dict_comment
> >>> en_ispell spell_init(internal)
> >>>
> >>
> DictFile=/usr/lib/ispell/english.med,AffFile=/usr/lib/ispell/english.aff,StopFile=/usr/share/pgsql/contrib/english.stop
> >>> spell_lexize(internal,internal,integer)
> >>> en_stem snb_en_init(internal) contrib/english.stop
> >>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
> >>> ispell_template spell_init(internal)
> >>> spell_lexize(internal,internal,integer) ISpell interface. Must have
> >> .dict
> >>> and .aff files
> >>> ru_ispell_cp1251 spell_init(internal)
> >>>
> >>
> DictFile=/usr/lib/ispell/russian.med,AffFile=/usr/lib/ispell/russian.aff,StopFile=/usr/share/pgsql/contrib/russian.stop.cp1251
> >>> spell_lexize(internal,internal,integer)
> >>> ru_stem_cp1251 snb_ru_init_cp1251(internal)
> >>> contrib/russian.stop.cp1251 snb_lexize(internal,internal,integer)
> >>> Russian Stemmer. Snowball. WINDOWS (cp1251) Encoding
> >>> ru_stem_koi8 snb_ru_init_koi8(internal) contrib/russian.stop
> >>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
> KOI8
> >>> Encoding
> >>>
> ru_stem_utf8 snb_ru_init_utf8(internal) contrib/russian.stop.utf8
> >>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
> UTF8
> >>> Encoding
> >>>
> >>
> simple dex_init(internal) dex_lexize(internal,internal,integer)
> >>> Simple example of dictionary.
> >>> synonym syn_init(internal)
> >>> syn_lexize(internal,internal,integer) Example of synonym dictionary
> >>> thesaurus_template thesaurus_init(internal)
> >>> thesaurus_lexize(internal,internal,integer,internal) Thesaurus
> >> template,
> >>> must be pointed Dictionary and DictFile
> >>>
> >>
> >> Regards,
> >> Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
> >>
> >> ---------------------------(end of
> broadcast)---------------------------
> >> TIP 1: if posting/reading through Usenet, please send an appropriate
> >> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
> your
> >> message can get through to the mailing list cleanly
> >>
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dmitry Koterov 2007-08-22 22:43:24 Re: Postgres, fsync and RAID controller with 100M of internal cache & dedicated battery
Previous Message Joshua D. Drake 2007-08-22 20:37:27 Re: Geographic High-Availability/Replication