Re: TSearch2: Problems with compound words and stop words

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Timo Haberkern <thaberkern(at)emedia-office(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: TSearch2: Problems with compound words and stop words
Date: 2004-11-17 16:26:57
Message-ID: Pine.GSO.4.61.0411171923070.18871@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Timo,

take a look into .aff file and search 'compoundwords'.
german ispell file I got from http://j3e.de/ispell/igerman98/ has no
support for compound words: 'compoundwords off'

Norwegian, for example, has:

compoundwords controlled z

compoundmin 4

Oleg

On Wed, 17 Nov 2004, Oleg Bartunov wrote:

> On Wed, 17 Nov 2004, Timo Haberkern wrote:
>
>> sorry for the late answer, i was on holyday,
>>
>> see my remarks below
>>
>>
>> Oleg Bartunov wrote:
>>
>>> On Fri, 5 Nov 2004, Timo Haberkern wrote:
>>>
>>>> Oleg,
>>>>
>>>> i use TSearch2 with PostgreSQL 7.4.6 and i applied the compoundword patch
>>>> yesterday. The configuration changed a little bit but the result is the
>>>> same. I get no compound words. I'm using the locale de_DE with encoding
>>>> ISO8859-1 for the database.
>>>>
>>>> I think i spell is working correctly except the compound words. If i try
>>>>
>>>> SELECT lexize('de_ispell', 'springt')
>>>>
>>>> i get
>>>>
>>>> lexize
>>>> {springen,springen}
>>>>
>>>> which seems correct.
>>>>
>>>>
>>>> But a SELECT lexize('de_ispell', 'Autobahn')
>>>>
>>>> results in
>>>>
>>>> lexize
>>>> {autobahn}
>>>>
>>>> i would expect {auto,bahn, autobahn}
>>>
>>>
>>> Hmm, have you checked 'Autobahn' in ispell dictionary ? Does dictionary
>>> you used supports 'Z' flag for compound words ?
>>
>> Autobahn is in the ispell dictionary. What does a ispell dictionary need
>> to support the Z flag???
>>
>
> Try ispell -C Autobahn search 'compound' in 'man ispell' for details. the
> problem exists only if ispell *does* splits word correctly while tsearch2
> doesn't. You should find correct ispell dictionary for german or create it
> yourself. You may consult monzilla.net
> http://staff.science.uva.nl/~christof/monzilla/research/project-dr.html
>
>
>>
>> Timo
>>
>>
>>
>>
>>
>>>
>>>
>>>>
>>>> The new configuration after the compound word patch:
>>>>
>>>
>>> Seems you overestimate my capabilities :)
>>>
>>>
>>>>
>>>> Actions dict_name
>>>> <http://www.rotex-service.com/phppgadmin/display.php?database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=2&sortdir=asc&strings=expanded&page=1>
>>>> dict_init
>>>> <http://www.rotex-service.com/phppgadmin/display.php?database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=3&sortdir=asc&strings=expanded&page=1>
>>>> dict_initoption
>>>> <http://www.rotex-service.com/phppgadmin/display.php?database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=4&sortdir=asc&strings=expanded&page=1>
>>>> dict_lexize
>>>> <http://www.rotex-service.com/phppgadmin/display.php?database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=5&sortdir=asc&strings=expanded&page=1>
>>>> dict_comment
>>>> <http://www.rotex-service.com/phppgadmin/display.php?database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=6&sortdir=asc&strings=expanded&page=1>
>>>> Edit
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confeditrow&strings=expanded&page=1&key%5Bdict_name%5D=simple&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> Delete
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confdelrow&strings=expanded&page=1&key%5Bdict_name%5D=simple&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> simple dex_init(text) /NULL/
>>>> dex_lexize(internal,internal,integer) Simple example of dictionary.
>>>> Edit
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confeditrow&strings=expanded&page=1&key%5Bdict_name%5D=en_stem&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> Delete
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confdelrow&strings=expanded&page=1&key%5Bdict_name%5D=en_stem&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> en_stem snb_en_init(text) /usr/local/pgsql/share/contrib/english.stop
>>>> snb_lexize(internal,internal,integer) English Stemmer. Snowball.
>>>> Edit
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confeditrow&strings=expanded&page=1&key%5Bdict_name%5D=ru_stem&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> Delete
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confdelrow&strings=expanded&page=1&key%5Bdict_name%5D=ru_stem&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> ru_stem snb_ru_init(text) /usr/local/pgsql/share/contrib/russian.stop
>>>> snb_lexize(internal,internal,integer) Russian Stemmer. Snowball.
>>>> Edit
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confeditrow&strings=expanded&page=1&key%5Bdict_name%5D=ispell_template&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> Delete
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confdelrow&strings=expanded&page=1&key%5Bdict_name%5D=ispell_template&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> ispell_template spell_init(text) /NULL/
>>>> spell_lexize(internal,internal,integer) ISpell interface. Must have
>>>> .dict and .aff files
>>>> Edit
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confeditrow&strings=expanded&page=1&key%5Bdict_name%5D=synonym&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> Delete
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confdelrow&strings=expanded&page=1&key%5Bdict_name%5D=synonym&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> synonym syn_init(text) /NULL/
>>>> syn_lexize(internal,internal,integer) Example of synonym dictionary
>>>> Edit
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confeditrow&strings=expanded&page=1&key%5Bdict_name%5D=de_ispell&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> Delete
>>>> <http://www.rotex-service.com/phppgadmin/display.php?action=confdelrow&strings=expanded&page=1&key%5Bdict_name%5D=de_ispell&database=selina_rotex&schema=public&table=pg_ts_dict&return_url=tblproperties.php%3Fdatabase%3Dselina_rotex%26amp%3Bschema%3Dpublic%26table%3Dpg_ts_dict&return_desc=Back&sortkey=&sortdir=>
>>>> de_ispell spell_init(text)
>>>> DictFile="/usr/local/pgsql/share/contrib/dictonary/german_comb.dict",
>>>> AffFile="/usr/local/pgsql/share/contrib/dictonary/german_comb.aff",
>>>> StopFile="/usr/local/pgsql/share/contrib/dictonary/german.stop"
>>>> spell_lexize(internal,internal,integer) /NULL/
>>>>
>>>>
>>>>
>>>> Timo
>>>>
>>>>
>>>> Oleg Bartunov wrote:
>>>>
>>>>> Timo,
>>>>>
>>>>> please, check you apply patch for compound word support.
>>>>> What is version of postgresql ?
>>>>> Does ispell dict works for non-compound words ?
>>>>>
>>>>> Oleg
>>>>>
>>>>> On Fri, 5 Nov 2004, Timo Haberkern wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> i have some troubles with my TSearch2 Installation. I have done this
>>>>>> installation as described in
>>>>>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words
>>>>>> <http://www.sai.msu.su/%7Emegera/oddmuse/index.cgi/Tsearch_V2_compound_words>
>>>>>> I used the german myspell dictionary from
>>>>>> http://lingucomponent.openoffice.org/spell_dic.html and converted it
>>>>>> with
>>>>>> my2ispell
>>>>>>
>>>>>> Nearly everything is working fine so far, except two problems:
>>>>>>
>>>>>> 1.) The stopword-file seems to be ignored: If i try it with SELECT
>>>>>> to_tsvector("default_german", "ein Haus") i get "ein":1 "haus":2
>>>>>>
>>>>>> ein should be a Stopword for german (and is defined the german.stop
>>>>>> file as
>>>>>> well)
>>>>>>
>>>>>> 2.) The compound words feature doesn"t work too. I have tried a lot of
>>>>>> words,
>>>>>> i.e. "Fehlermeldung" with SELECT to_tsvector("default_german",
>>>>>> "Fehlermeldung")
>>>>>> i only get
>>>>>> "fehlermeldung":1 but i would expect "fehler" and "meldung" as
>>>>>> seperated
>>>>>> entries. Is there anything wrong with the dictonary or my
>>>>>> configuration?
>>>>>>
>>>>>>
>>>>>> My current configuration:
>>>>>>
>>>>>> pg_ts_cfg:
>>>>>>
>>>>>> default default C
>>>>>> default_russian default ru_RU.KOI8-R
>>>>>> simple default NULL
>>>>>> default_german default de_DE.ISO8859-1
>>>>>> pg_ts_cfgmap:
>>>>>>
>>>>>> default_german host {simple}
>>>>>> default_german hword {simple}
>>>>>> default_german int {simple}
>>>>>> default_german nlhword {simple}
>>>>>> default_german nlpart_hword {simple}
>>>>>> default_german nlword {simple}
>>>>>> default_german part_hword {simple}
>>>>>> default_german sfloat {simple}
>>>>>> default_german uint {simple}
>>>>>> default_german uri {simple}
>>>>>> default_german url {simple}
>>>>>> default_german version {simple}
>>>>>> default_german word {simple}
>>>>>> default_german lpart_hword {de_ispell,german_snowball}
>>>>>> default_german lword {de_ispell,german_snowball}
>>>>>> default_german lhword {de_ispell,german_snowball}
>>>>>>
>>>>>>
>>>>>> pg_ts_dict:
>>>>>>
>>>>>> de_ispell | 17166 |
>>>>>> DictFile="/usr/local/pgsql/share/contrib/dictonary/german.dict",
>>>>>> AffFile="/usr/local/pgsql/share/contrib/dictonary/german.aff",
>>>>>> StopFile="/usr/local/pgsql/share/contrib/dictonary/german.stop" |
>>>>>> 17167 | NULL
>>>>>> german_snowball | 17357 | NULL | 17162 | Snowball stemmer for
>>>>>> german
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can anyone help me?
>>>>>>
>>>>>> regards
>>>>>>
>>>>>> Timo
>>>>>>
>>>>>>
>>>>>> ---------------------------(end of
>>>>>> broadcast)---------------------------
>>>>>> TIP 4: Don't 'kill -9' the postmaster
>>>>>>
>>>>>
>>>>> Regards,
>>>>> Oleg
>>>>> _____________________________________________________________
>>>>> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>>>>> Sternberg Astronomical Institute, Moscow University (Russia)
>>>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>>>> phone: +007(095)939-16-83, +007(095)939-23-83
>>>>>
>>>>> ---------------------------(end of broadcast)---------------------------
>>>>> TIP 2: you can get off all lists at once with the unregister command
>>>>> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>>>>>
>>>>>
>>>>
>>>
>>> Regards,
>>> Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>>> Sternberg Astronomical Institute, Moscow University (Russia)
>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>> phone: +007(095)939-16-83, +007(095)939-23-83
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 2: you can get off all lists at once with the unregister command
>>> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>>>
>>>
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2004-11-17 16:31:55 Re: Tsearch2 and Unicode?
Previous Message Robert Fitzpatrick 2004-11-17 16:20:41 Rules WHERE condition