Re: fulltext search and hunspell

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Jens Sauer <jsauer65(at)googlemail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: fulltext search and hunspell
Date: 2011-02-08 10:34:32
Message-ID: Pine.LNX.4.64.1102081333380.31836@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Jens,

have you tried german compound dictionary from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:

> Hey,
>
> thanks for your answer.
>
> First I checked the links in the tsearch_data directory
> de_de.affix, and de_de.dict are symlinks to the corresponding files in
> /var/cache/postgresql/dicts/
> Then I recreated them by using pg_updatedicts.
>
> This is an extract of the de_de.affix file:
>
> # this is the affix file of the de_DE Hunspell dictionary
> # derived from the igerman98 dictionary
> #
> # Version: 20091006 (build 20100127)
> #
> # Copyright (C) 1998-2009 Bjoern Jacke <bjoern(at)j3e(dot)de>
> #
> # License: GPLv2, GPLv3 or OASIS distribution license agreement
> # There should be a copy of both of this licenses included
> # with every distribution of this dictionary. Modified
> # versions using the GPL may only include the GPL
>
> SET ISO8859-1
> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
>
> PFX U Y 1
> PFX U 0 un .
>
> PFX V Y 1
> PFX V 0 ver .
>
> SFX F Y 35
> [...]
>
> I cannot find "compoundwords controlled z" there, so I manually added it.
>
> [...]
> # versions using the GPL may only include the GPL
>
> compoundwords controlled z
>
> SET ISO8859-1
> TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
> [...]
>
> Then I restarted PostgreSQL.
>
> Now I get an error:
> SELECT * FROM ts_debug('Schokoladenfabrik');
> FEHLER: falsches Affixdateiformat f?r Flag
> CONTEXT: Zeile 18 in Konfigurationsdatei
> ?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
> ?
> SQL-Funktion ?ts_debug? Anweisung 1
> SQL-Funktion ?ts_debug? Anweisung 1
>
> Which means:
> ERROR: wrong Affixfileformat for flag
> CONTEXT: Line 18 in Configuration ...
>
> If I add
> COMPOUNDFLAG Z
> ONLYINCOMPOUND L
>
> instead of "compoundwords controlled z"
>
> I didn't get an error:
>
> SELECT * FROM ts_debug('Schokoladenfabrik');
> alias | description | token |
> dictionaries | dictionary | lexemes
> -----------+-----------------+-------------------+-------------------------------+-------------+-------------------
> asciiword | Word, all ASCII | Schokoladenfabrik |
> {german_hunspell,german_stem} | german_stem | {schokoladenfabr}
> (1 row)
>
> But it seems that the hunspell dictionary is not working for compound words.
>
> Maybe pg_updatedicts has a bug and generates affix files in the wrong format?
>
> Jens
>
> 2011/2/7 Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>:
>> Jens,
>>
>> could you check affix file for
>> compoundwords  controlled z
>>
>> also, can you provide link to dictionary files, so we can check if they
>> supported, since we have only rudiment support of hunspell.
>> btw,it'd be nice to have output from ts_debug() to make sure dictionaries
>> actually used.
>>
>> Oleg
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael 2011-02-08 10:39:04 Displaying text appears as hex data
Previous Message Thom Brown 2011-02-08 10:12:47 Re: [HACKERS] Issues with generate_series using integer boundaries