Re: [to_tsvector] German Compound Words

From: "Sven R(dot) Kunze" <srkunze(at)tbz-pariv(dot)de>
To: obartunov(at)gmail(dot)com, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: [to_tsvector] German Compound Words
Date: 2015-06-01 08:13:05
Message-ID: 556C1411.4010608@tbz-pariv.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Alright. I got it running and used
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ ; specifically:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

Not sure where to find up-to-date/authorized the ispell dictionaries. I
figured that I need to change this particular dictionary in order to
avoid "ion" being split aways from words like "produktION/konstruktION" etc:

=# select * from ts_debug('public.german_compound_ispell', 'konstruktion');+
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-----------------+--------------+-----------------------------+---------------+------------------------------
asciiword | Word, all ASCII | konstruktion |
{german_ispell,german_stem} | german_ispell | {konstruktion,konstrukt,ion}

The splitting of compound words is unfortunately not consistent
(wasserkraft vs konstruktionsplan):

=# select * from ts_debug('public.german_compound_ispell', 'wasserkraft');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-----------------+-------------+-----------------------------+---------------+----------------------------
asciiword | Word, all ASCII | wasserkraft |
{german_ispell,german_stem} | german_ispell | {wasserkraft,wasser,kraft}

=# select * from ts_debug('public.german_compound_ispell',
'konstruktionsplan');
alias | description | token | dictionaries
| dictionary | lexemes
-----------+-----------------+-------------------+-----------------------------+---------------+---------------------
asciiword | Word, all ASCII | konstruktionsplan |
{german_ispell,german_stem} | german_ispell | {konstruktion,plan}

Not sure how the 'sch' come to be:

=# select * from ts_debug('public.german_compound_ispell', 'rundflansch');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-----------------+-------------+-----------------------------+---------------+------------------------------
asciiword | Word, all ASCII | rundflansch |
{german_ispell,german_stem} | german_ispell | {rund,flansch,rund,flan,sch}

This is another funny example:

=# select * from ts_debug('public.german_compound_ispell', 'datenbanken');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-----------------+-------------+-----------------------------+---------------+---------------------------------------------------------------------------------
asciiword | Word, all ASCII | datenbanken |
{german_ispell,german_stem} | german_ispell |
{datenbank,daten,date,banken,daten,date,bank,daten,date,banken,daten,date,bank}

On 01.06.2015 09:25, Sven R. Kunze wrote:
> I actually wanted to minimize the installation effort. Thus, I used
> the hunspell-de-de package of Debian/Ubuntu.
>
> Give me a second for ispell.
>
> Below, see the hunspell variant for
> Produktionsintervall/Produktionintervall:
>
> =# select * from ts_debug('public.german_compound',
> 'Produktionsintervall');
> alias | description | token |
> dictionaries | dictionary | lexemes
> -----------+-----------------+----------------------+-------------------------------+-------------+------------------------
> asciiword | Word, all ASCII | Produktionsintervall |
> {german_hunspell,german_stem} | german_stem | {produktionsintervall}
> (1 row)
>
> =# select * from ts_debug('public.german_compound',
> 'Produktionintervall');
> alias | description | token |
> dictionaries | dictionary | lexemes
> -----------+-----------------+---------------------+-------------------------------+-------------+-----------------------
> asciiword | Word, all ASCII | Produktionintervall |
> {german_hunspell,german_stem} | german_stem | {produktionintervall}
>
>
>
> PS: I post your answer to the list as well
>
> On 28.05.2015 19:42, Oleg Bartunov wrote:
>> For readability it's better to use
>>
>> select * from ts_debug
>>
>> I remember there is problem with correct support of hunspell files.
>> Did you try ispell files ?
>> Also, I found this messagehttp://www.postgresql.org/message-id/dm1ece$2gb5$1@news.hub.org
>>
>> Try this word - Produktionintervall
>>
>>
>> On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze <srkunze(at)tbz-pariv(dot)de
>> <mailto:srkunze(at)tbz-pariv(dot)de>> wrote:
>>
>> Sure. Here you are:
>>
>> =# select ts_debug('public.german_compound', 'wasserkraft');
>> ts_debug
>> -----------------------------------------------------------------------------------------------------
>> (asciiword,"Word, all
>> ASCII",wasserkraft,"{german_hunspell,german_stem}",german_stem,{wasserkraft})
>>
>> =# select ts_debug('public.german_compound', 'schifffahrt');
>> ts_debug
>> ---------------------------------------------------------------------------------------------------------
>> (asciiword,"Word, all
>> ASCII",schifffahrt,"{german_hunspell,german_stem}",german_hunspell,{schifffahrt})
>>
>> =# select ts_debug('public.german_compound', 'blindflansch');
>> ts_debug
>> -------------------------------------------------------------------------------------------------------
>> (asciiword,"Word, all
>> ASCII",blindflansch,"{german_hunspell,german_stem}",german_stem,{blindflansch})
>>
>> That is my testing configuration:
>>
>> =# \dF+ german_compound
>> Text search configuration "public.german_compound"
>> Parser: "pg_catalog.default"
>> Token | Dictionaries
>> -----------------+-----------------------------
>> asciihword | german_hunspell,german_stem
>> asciiword | german_hunspell,german_stem
>> email | simple
>> file | simple
>> float | simple
>> host | simple
>> hword | german_hunspell,german_stem
>> hword_asciipart | german_hunspell,german_stem
>> hword_numpart | simple
>> hword_part | german_hunspell,german_stem
>> int | simple
>> numhword | simple
>> numword | simple
>> sfloat | simple
>> uint | simple
>> url | simple
>> url_path | simple
>> version | simple
>> word | german_hunspell,german_stem
>>
>>
>> On 28.05.2015 17:24, Oleg Bartunov wrote:
>>> ts_debug() ?
>>>
>>> =# select * from ts_debug('english', 'messages');
>>> alias | description | token | dictionaries |
>>> dictionary | lexemes
>>> -----------+-----------------+----------+----------------+--------------+----------
>>> asciiword | Word, all ASCII | messages | {english_stem} |
>>> english_stem | {messag}
>>>
>>>
>>> On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze
>>> <srkunze(at)tbz-pariv(dot)de <mailto:srkunze(at)tbz-pariv(dot)de>> wrote:
>>>
>>> Hi everybody,
>>>
>>> what do I need to do in order to enable compound word
>>> handling in PostgreSQL tsvector implementation?
>>>
>>> I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have
>>> installed package hunspell-de-de and already created a new
>>> dictionary as described here:
>>> http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
>>>
>>> CREATE TEXT SEARCH DICTIONARY german_hunspell (
>>> TEMPLATE = ispell,
>>> DictFile = de_de,
>>> AffFile = de_de,
>>> StopWords = german
>>> );
>>>
>>> Furthermore, created a new test text search configuration
>>> (copied from german) and updated all parser parts where the
>>> german_stem dictionary is used so that it uses
>>> german_hunspell first and then german_stem.
>>>
>>> However, ts_vector still does not work for the compound
>>> words such as:
>>>
>>> wasserkraft -> wasserkraft, kraft
>>> schifffahrt -> schifffahrt, fahrt
>>> blindflansch -> blindflansch, flansch
>>>
>>> etc.
>>>
>>>
>>> What have I done wrong here?
>>>
>>> --
>>> Sven R. Kunze
>>> TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
>>> Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
>>> e-mail: srkunze(at)tbz-pariv(dot)de <mailto:srkunze(at)tbz-pariv(dot)de>
>>> web: www.tbz-pariv.de <http://www.tbz-pariv.de>
>>>
>>> Geschäftsführer: Dr. Reiner Wohlgemuth
>>> Sitz der Gesellschaft: Chemnitz
>>> Registergericht: Chemnitz HRB 8543
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list
>>> (pgsql-general(at)postgresql(dot)org
>>> <mailto:pgsql-general(at)postgresql(dot)org>)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>>
>>>
>>
>>
>> --
>> Sven R. Kunze
>> TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
>> Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
>> e-mail:srkunze(at)tbz-pariv(dot)de <mailto:srkunze(at)tbz-pariv(dot)de>
>> web:www.tbz-pariv.de <http://www.tbz-pariv.de>
>>
>> Geschäftsführer: Dr. Reiner Wohlgemuth
>> Sitz der Gesellschaft: Chemnitz
>> Registergericht: Chemnitz HRB 8543
>>
>>
>
>
> --
> Sven R. Kunze
> TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
> Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
> e-mail:srkunze(at)tbz-pariv(dot)de
> web:www.tbz-pariv.de
>
> Geschäftsführer: Dr. Reiner Wohlgemuth
> Sitz der Gesellschaft: Chemnitz
> Registergericht: Chemnitz HRB 8543

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze(at)tbz-pariv(dot)de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2015-06-01 08:58:09 Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Previous Message Albe Laurenz 2015-06-01 07:32:55 Re: date type changing to timestamp without time zone in postgres 9.4