Re: [to_tsvector] German Compound Words

From: "Sven R(dot) Kunze" <srkunze(at)tbz-pariv(dot)de>
To: obartunov(at)gmail(dot)com, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: [to_tsvector] German Compound Words
Date: 2015-06-01 07:25:18
Message-ID: 556C08DE.9000102@tbz-pariv.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I actually wanted to minimize the installation effort. Thus, I used the
hunspell-de-de package of Debian/Ubuntu.

Give me a second for ispell.

Below, see the hunspell variant for
Produktionsintervall/Produktionintervall:

=# select * from ts_debug('public.german_compound', 'Produktionsintervall');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+----------------------+-------------------------------+-------------+------------------------
asciiword | Word, all ASCII | Produktionsintervall |
{german_hunspell,german_stem} | german_stem | {produktionsintervall}
(1 row)

=# select * from ts_debug('public.german_compound', 'Produktionintervall');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+---------------------+-------------------------------+-------------+-----------------------
asciiword | Word, all ASCII | Produktionintervall |
{german_hunspell,german_stem} | german_stem | {produktionintervall}

PS: I post your answer to the list as well

On 28.05.2015 19:42, Oleg Bartunov wrote:
> For readability it's better to use
>
> select * from ts_debug
>
> I remember there is problem with correct support of hunspell files.
> Did you try ispell files ?
> Also, I found this messagehttp://www.postgresql.org/message-id/dm1ece$2gb5$1@news.hub.org
>
> Try this word - Produktionintervall
>
>
> On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze <srkunze(at)tbz-pariv(dot)de
> <mailto:srkunze(at)tbz-pariv(dot)de>> wrote:
>
> Sure. Here you are:
>
> =# select ts_debug('public.german_compound', 'wasserkraft');
> ts_debug
> -----------------------------------------------------------------------------------------------------
> (asciiword,"Word, all
> ASCII",wasserkraft,"{german_hunspell,german_stem}",german_stem,{wasserkraft})
>
> =# select ts_debug('public.german_compound', 'schifffahrt');
> ts_debug
> ---------------------------------------------------------------------------------------------------------
> (asciiword,"Word, all
> ASCII",schifffahrt,"{german_hunspell,german_stem}",german_hunspell,{schifffahrt})
>
> =# select ts_debug('public.german_compound', 'blindflansch');
> ts_debug
> -------------------------------------------------------------------------------------------------------
> (asciiword,"Word, all
> ASCII",blindflansch,"{german_hunspell,german_stem}",german_stem,{blindflansch})
>
> That is my testing configuration:
>
> =# \dF+ german_compound
> Text search configuration "public.german_compound"
> Parser: "pg_catalog.default"
> Token | Dictionaries
> -----------------+-----------------------------
> asciihword | german_hunspell,german_stem
> asciiword | german_hunspell,german_stem
> email | simple
> file | simple
> float | simple
> host | simple
> hword | german_hunspell,german_stem
> hword_asciipart | german_hunspell,german_stem
> hword_numpart | simple
> hword_part | german_hunspell,german_stem
> int | simple
> numhword | simple
> numword | simple
> sfloat | simple
> uint | simple
> url | simple
> url_path | simple
> version | simple
> word | german_hunspell,german_stem
>
>
> On 28.05.2015 17:24, Oleg Bartunov wrote:
>> ts_debug() ?
>>
>> =# select * from ts_debug('english', 'messages');
>> alias | description | token | dictionaries |
>> dictionary | lexemes
>> -----------+-----------------+----------+----------------+--------------+----------
>> asciiword | Word, all ASCII | messages | {english_stem} |
>> english_stem | {messag}
>>
>>
>> On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze
>> <srkunze(at)tbz-pariv(dot)de <mailto:srkunze(at)tbz-pariv(dot)de>> wrote:
>>
>> Hi everybody,
>>
>> what do I need to do in order to enable compound word
>> handling in PostgreSQL tsvector implementation?
>>
>> I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed
>> package hunspell-de-de and already created a new dictionary
>> as described here:
>> http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
>>
>> CREATE TEXT SEARCH DICTIONARY german_hunspell (
>> TEMPLATE = ispell,
>> DictFile = de_de,
>> AffFile = de_de,
>> StopWords = german
>> );
>>
>> Furthermore, created a new test text search configuration
>> (copied from german) and updated all parser parts where the
>> german_stem dictionary is used so that it uses
>> german_hunspell first and then german_stem.
>>
>> However, ts_vector still does not work for the compound words
>> such as:
>>
>> wasserkraft -> wasserkraft, kraft
>> schifffahrt -> schifffahrt, fahrt
>> blindflansch -> blindflansch, flansch
>>
>> etc.
>>
>>
>> What have I done wrong here?
>>
>> --
>> Sven R. Kunze
>> TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
>> Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
>> e-mail: srkunze(at)tbz-pariv(dot)de <mailto:srkunze(at)tbz-pariv(dot)de>
>> web: www.tbz-pariv.de <http://www.tbz-pariv.de>
>>
>> Geschäftsführer: Dr. Reiner Wohlgemuth
>> Sitz der Gesellschaft: Chemnitz
>> Registergericht: Chemnitz HRB 8543
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list
>> (pgsql-general(at)postgresql(dot)org
>> <mailto:pgsql-general(at)postgresql(dot)org>)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>
>
> --
> Sven R. Kunze
> TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
> Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
> e-mail:srkunze(at)tbz-pariv(dot)de <mailto:srkunze(at)tbz-pariv(dot)de>
> web:www.tbz-pariv.de <http://www.tbz-pariv.de>
>
> Geschäftsführer: Dr. Reiner Wohlgemuth
> Sitz der Gesellschaft: Chemnitz
> Registergericht: Chemnitz HRB 8543
>
>

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze(at)tbz-pariv(dot)de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Albe Laurenz 2015-06-01 07:32:55 Re: date type changing to timestamp without time zone in postgres 9.4
Previous Message Evi-M 2015-06-01 07:22:37 Re: Help me recovery databases.