tsearch2, ispell, utf-8 and german special characters

From: "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de>
To: <pgsql-general(at)postgresql(dot)org>, <openfts-general(at)lists(dot)sourceforge(dot)net>
Subject: tsearch2, ispell, utf-8 and german special characters
Date: 2004-07-20 16:49:39
Message-ID: 2266D0630E43BB4290742247C891057505BF2D18@dozer.computec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

Sorry to bother you, but I just don't know how to get tsearch2 configured correctly for my setup. I've got a 7.4.3 database-cluster initdb'ed with de_DE(at)euro as locale, the database is with Unicode encoding.

I made and installed contrib/tsearch2 after installing the dump/reload-patch http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz as advised by the docs. So far everything is looking good, I have generated a snowball stemmer dictionary and an ispell dictionary as described in the docs and created a new configuration 'default_german' as described.

This is working somehow:
SELECT to_tsvector('default_german',
'tsearch2 erlernen ist wie zur Schule zu gehen');
-> 'gehen':10 'schulen':8 'erlernen':3 'tsearch2':2

though I don't quite understand why "Schule" is converted to "schulen" and not the other way round, but so be it. My problem lies, as every so often, with the non-ascii-characters, namely german umlauts and the ß.

SELECT to_tsvector('default_german',
'ich muß tsearch2 begreifen ');

returns null. So does any phrase which contains ÄÖÜäüß or anything that's beyond ASCII.

Another thing is the ISpell functionality; the docs are quite vague on this part when it comes to explaining which file(s) to use to create german.med. In ISpell conventions, umlauts seem to be represented as A" a" O" o" U" u" and thus when doing

SELECT lexize('de_ispell', 'Äther');
I receive NULL

whereas
SELECT lexize('de_ispell', 'A"ther');
gives me {"a\"ther"}
as result.

I downloaded igerman98-20030222.tar.bz2 from http://j3e.de/ispell/igerman98/dict/ which seems to be the recommended ISpell dictionary distribution for the german language as noted on http://fmg-www.cs.ucla.edu/fmg-members/geoff/ispell-dictionaries.html#German-dicts

Of course there are no german.0 or german.1 files in this distribution which would be the obvious counterparts to english.0 and english.1 mentioned in the tsearch2-docs; there is however a file all.words built on installation, which seems to be the basis for building the hash-file later on. The first few lines of this file are

A"bte/N
A"btissin/F
a"chten/DIXY
A"chtens
A"chtung/P
a"chzen/DIXY
a"chzt/EGPX
A"cker/N

In order to get the .med-File I did sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words

There is an option to generate another wordlist via make isowordlist - but this didn't resolve the umlaut-issue either, neither in the standard encoding provided in the package nor after conversion to UTF-8 (I tried both with and without a BOM).

Now has anybody actually managed to get a working configuration with tsearch2 and german language support in a unicode-database? What am I doing wrong? I just can't find any more hints in the docs, and there's a topic on the OpenFTS-Mailinglist with somewhat similar issues ( http://sourceforge.net/mailarchive/forum.php?thread_id=3979419&forum_id=7671 ), but nothing which would actually help to resolve it.

Kind regards

Markus

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Eisentraut 2004-07-20 17:26:41 Re: tsearch2, ispell, utf-8 and german special characters
Previous Message Scott Marlowe 2004-07-20 15:45:06 Re: Stored procedures and "pseudo" fields..