Re: tsearch2, ispell, utf-8 and german special characters

From: "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de>
To: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc: <pgsql-general(at)postgresql(dot)org>, <openfts-general(at)lists(dot)sourceforge(dot)net>
Subject: Re: tsearch2, ispell, utf-8 and german special characters
Date: 2004-07-21 15:03:58
Message-ID: 2266D0630E43BB4290742247C891057505BF2F10@dozer.computec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

I managed to resolve the issue with the unrecognized stop-word 'aber': The stopword-file was utf-8-encoded WITH a Byte Order Mark (BOM) - which is not recognized correctly (i.e. ignored), so the first word of the stopword-file, which is 'aber' was not recognized correctly. After removing the BOM, 'aber' was correctly filtered out as a stop-word.

The issue with the unrecognized stop-word 'ein' which is converted by to_tsvector to 'eint' remains however. Now here's as much detail as I can provide:

We're using PostgreSQL 7.4.3, initdb'ed to a de_DE.utf8 locale; the database is in UNICODE encoding. I used the tsearch2-module provided in the /contrib-directory of the pg7.3.4-sources; I applied the patch from http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz. OS is SuSE 7.3, LC_ALL and all other locale-variables are set to de_DE.utf8. Ispell is Version 3.1.20 10/10/95, patch 1.

Here's my tsearch2-config:
=========================================
select * from pg_ts_cfg:
ts_name;prs_name;locale
default;default;C
default_russian;default;ru_RU.KOI8-R
simple;default;
default_german;default;de_DE.utf8

select * from pg_ts_cfgmap where ts_name='default_german':
ts_name;tok_alias;dict_name
default_german;url;{simple}
default_german;host;{simple}
default_german;sfloat;{simple}
default_german;uri;{simple}
default_german;int;{simple}
default_german;float;{simple}
default_german;email;{simple}
default_german;word;{simple}
default_german;hword;{simple}
default_german;nlword;{simple}
default_german;nlpart_hword;{simple}
default_german;part_hword;{simple}
default_german;nlhword;{simple}
default_german;file;{simple}
default_german;uint;{simple}
default_german;version;{simple}
default_german;lhword;{de_ispell}
default_german;lpart_hword;{de_ispell}
default_german;lword;{de_ispell}

select * from pg_ts_dict:
dict_name;dict_init;dict_initoption;dict_lexize;dict_comment
simple;dex_init(text);;dex_lexize(internal,internal,integer);Simple example of dictionary.
en_stem;snb_en_init(text);/var/lib/pgsql/data/base/contrib/english.stop;snb_lexize(internal,internal,integer);English Stemmer. Snowball.
ru_stem;snb_ru_init(text);/var/lib/pgsql/data/base/contrib/russian.stop;snb_lexize(internal,internal,integer);Russian Stemmer. Snowball.
ispell_template;spell_init(text);;spell_lexize(internal,internal,integer);ISpell interface. Must have .dict and .aff files
synonym;syn_init(text);;syn_lexize(internal,internal,integer);Example of synonym dictionary
de_ispell;spell_init(text);DictFile="/usr/lib/ispell/german.med",AffFile="/usr/lib/ispell/german.aff",StopFile="/var/lib/pgsql/data/base/contrib/german.stop";spell_lexize(internal,internal,integer);

select * from pg_ts_parser:
prs_name;prs_start;prs_nexttoken;prs_end;prs_headline;prs_lextype;prs_comment
default;prsd_start(internal,integer);prsd_getlexeme(internal,internal,internal);prsd_end(internal);prsd_headline(internal,internal,internal);prsd_lextype(internal);Parser from OpenFTS v0.34
=========================================
ISpell-Dictionary:
To generate the german ISpell-Dictionary, I did
wget http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2
bunzip2 igerman98-20030222.tar.bz2
tar -xvf igerman98-20030222.tar
cd igerman98-20030222
joe Makefile
[ there I set
LANG = de_DE.utf8
LC_ALL = de_DE.utf8
LC_COLLATE = de_DE.utf8
]
make
make install
sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words
cp german.med /usr/lib/ispell/
=========================================
The stopwords-file is just a plain text-file in UTF-8 encoding with one word per line, like this:
aber
alle
allem
allen
aller
[...]
wollen
wollte
zu
zum
zur
zwar
zwischen

All in all that's 262 words, one on each line. Though the ß-characters (sharp s) in the file looks broken when doing cat german.stop, everything looks fine in vim and I can enter the character correctly on the commandline - I suspect there's something wrong with my SSH terminal (PuTTY) or some misconfiguration between bash and PuTTY.
=========================================

I hope I have provided all the necessary information needed to help me clarify whether or not to deploy tsearch2 or what to do in order to receive consistent results. I'd be happy to contribute to the docs for implementing tsearch2 for a german unicode database, once all issues are resolved.

Thank you very much for your help!

Kind regards

Markus

> -----Ursprüngliche Nachricht-----
> Von: Oleg Bartunov [mailto:oleg(at)sai(dot)msu(dot)su]
> Gesendet: Mittwoch, 21. Juli 2004 15:34
> An: Markus Wollny
> Cc: pgsql-general(at)postgresql(dot)org;
> openfts-general(at)lists(dot)sourceforge(dot)net
> Betreff: Re: [GENERAL] tsearch2, ispell, utf-8 and german
> special characters
>
> Marcus,
>
> it'd be easier for others if you show your tsearch2 configuration.
> btw, what version of pgsql and tsearch2 (any patches applied
> ?) Since I don't know german I could provide a little help,
> but I'd like to have some words from you when you get all
> things working right, so other people would appreciate your
> experience.
>
> I wouldn't use tsearch2 in production until you understand
> your problem and get tsearch2 works correctly.
>
>
> Oleg
>
> On Wed, 21 Jul 2004, Markus Wollny wrote:
>
> > Hi!
> >
> > Okay, I changed locale via initdb and I've got it working
> to some extent now.
> >
> > Now I've got some problem with the ISpell-dictionary and
> the stopwords-list. Both have been compiled with de_DE.utf8-locale.
> >
> > When I
> > SELECT to_tsvector('default_german',
> > 'Jeden Tag wirst Du ein bisschen ?lter,
> > aber Du lernst');
> >
> > I get
> > 'tag':2 'aber':8 'eint':5 'lernen':10 '?lter':7 'bisschen':6
> >
> > I've got three questions regarding this result:
> > 1. both 'ein' and 'aber' are included in the
> stopwords-file, but they show up in the result, whereas
> 'jeden', 'wirst', 'du' are removed correctly - why is the
> stopword-list ignored for the former two?
> > 2. why does 'ein' appear as 'eint'?
> > 3. is this result actually no cause of alarm, so can I
> deploy tsearch2 to my production databases nevertheless?
> >
> > I'm using
> http://j3e.de/ispell/igerman98/dict/igerman98-20030222.tar.bz2
> (the latest version of Heinz Knutzen's dictionary) and I've
> edited its Makefile to use de_DE.utf8 in the locale settings;
> all.words was indeed the file used to generate the hash, so I
> guess that I can now be more or less sure that I've actually
> followed the instructions in the docs precisely. I dropped
> any references to the german snowball stemmer dictionary
> which I had configured as fallback, so currently there's only
> this one dictionary configured for ts_name default_german and
> tok_alias lhword, lpard_hword, lword (the remaining tog_alias
> entries are set to use the simple dictionary).
> >
> > Kind regards
> >
> > Markus

Browse pgsql-general by date

  From Date Subject
Next Message Doug McNaught 2004-07-21 15:31:08 Re: Insert images through ASP
Previous Message Tom Lane 2004-07-21 15:03:21 Re: Using SELECT inside a COPY transaction with PHP