Re: Fulltext search configuration

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Mohamed <mohamed5432154321(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Fulltext search configuration
Date: 2009-02-02 15:34:22
Message-ID: Pine.LNX.4.64.0902021829280.4158@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 2 Feb 2009, Oleg Bartunov wrote:

> On Mon, 2 Feb 2009, Mohamed wrote:
>
>> Hehe, ok..
>> I don't know either but I took some lines from Al-Jazeera :
>> http://aljazeera.net/portal
>>
>> just made the change you said and created it successfully and tried this :
>>
>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
>> ?????')
>>
>> but I got nothing... :(
>
> Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
> information, else we can't help you.
>
>>
>> Is there a way of making sure that words not recognized also gets
>> indexed/searched for ? (Not that I think this is the problem)
>
> yes

Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary recognizes
it as a known word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or searched for.
The general rule for configuring a list of dictionaries is to place first
the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple,
which recognizes everything."

quick example:

CREATE TEXT SEARCH CONFIGURATION arabic (
COPY = english
);

=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stem

Then you can alter this configuration.

>
>
>>
>> / Moe
>>
>>
>>
>> On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>
>>> Mohamed,
>>>
>>> comment line in ar.affix
>>> #FLAG long
>>> and creation of ispell dictionary will work. This is temp, solution.
>>> Teodor
>>> is working on fixing affix autorecognizing.
>>>
>>> I can't say anything about testing, since somebody should provide
>>> first test case. I don't know how to type arabic :)
>>>
>>>
>>> Oleg
>>>
>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>
>>> Oleg, like I mentioned earlier. I have a different .affix file that I got
>>>> from Andrew with the stop file and I get no errors creating the
>>>> dictionary
>>>> using that one but I get nothing out from ts_lexize.
>>>> The size on that one is : 406,219 bytes
>>>> And the size on the hunspell one (first) : 406,229 bytes
>>>>
>>>> Little to close, don't you think ?
>>>>
>>>> It might be that the arabic hunspell (ayaspell) affix file is damaged on
>>>> some lines and I got the fixed one from Andrew.
>>>>
>>>> Just wanted to let you know.
>>>>
>>>> / Moe
>>>>
>>>>
>>>>
>>>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321(at)gmail(dot)com>
>>>> wrote:
>>>>
>>>> Ok, thank you Oleg.
>>>>> I have another dictionary package which is a conversion to hunspell
>>>>> aswell:
>>>>>
>>>>>
>>>>>
>>>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
>>>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
>>>>>
>>>>> And running that gives me this error : (again the affix file)
>>>>>
>>>>> ERROR: wrong affix file format for flag
>>>>> CONTEXT: line 560 of configuration file "C:/Program
>>>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
>>>>> 1013
>>>>> Y 6
>>>>> "
>>>>>
>>>>> / Moe
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>>>>>
>>>>> Mohamed,
>>>>>>
>>>>>> We are looking on the problem.
>>>>>>
>>>>>> Oleg
>>>>>>
>>>>>> On Mon, 2 Feb 2009, Mohamed wrote:
>>>>>>
>>>>>> No, I don't. But the ts_lexize don't return anything so I figured
>>>>>> there
>>>>>>
>>>>>>> must
>>>>>>> be an error somehow.
>>>>>>> I think we are using the same dictionary + that I am using the
>>>>>>> stopwords
>>>>>>> file and a different affix file, because using the hunspell (ayaspell)
>>>>>>> .aff
>>>>>>> gives me this error :
>>>>>>>
>>>>>>> ERROR: wrong affix file format for flag
>>>>>>> CONTEXT: line 42 of configuration file "C:/Program
>>>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
>>>>>>>
>>>>>>> / Moe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
>>>>>>> daniel(dot)chiaramello(at)golog(dot)net> wrote:
>>>>>>>
>>>>>>> Hi Mohamed.
>>>>>>>
>>>>>>>>
>>>>>>>> I don't know where you get the dictionary - I unsuccessfully tried
>>>>>>>> the
>>>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic
>>>>>>>> stopwords
>>>>>>>> file.
>>>>>>>>
>>>>>>>> Renaming the file is supposed to be enough (I did it successfully for
>>>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
>>>>>>>> When I tried to create the dictionary:
>>>>>>>>
>>>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell (
>>>>>>>> TEMPLATE = ispell,
>>>>>>>> DictFile = ar_utf8,
>>>>>>>> AffFile = ar_utf8,
>>>>>>>> StopWords = english
>>>>>>>> );
>>>>>>>>
>>>>>>>> I had an error:
>>>>>>>>
>>>>>>>> ERREUR: mauvais format de fichier affixe pour le drapeau
>>>>>>>> CONTEXTE : ligne 42 du fichier de configuration ?
>>>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y
>>>>>>>> 40
>>>>>>>>
>>>>>>>> (which means Bad format of Affix file for flag, line 42 of
>>>>>>>> configuration
>>>>>>>> file)
>>>>>>>>
>>>>>>>> Do you have an error when creating your dictionary?
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> Mohamed a ?crit :
>>>>>>>>
>>>>>>>>
>>>>>>>> I have ran into some problems here.
>>>>>>>> I am trying to implement arabic fulltext search on three columns.
>>>>>>>>
>>>>>>>> To create a dictionary I have a hunspell dictionary and and arabic
>>>>>>>> stop
>>>>>>>> file.
>>>>>>>>
>>>>>>>> CREATE TEXT SEARCH DICTIONARY hunspell_dic (
>>>>>>>> TEMPLATE = ispell,
>>>>>>>> DictFile = hunarabic,
>>>>>>>> AffFile = hunarabic,
>>>>>>>> StopWords = arabic
>>>>>>>> );
>>>>>>>>
>>>>>>>>
>>>>>>>> 1) The problem is that the hunspell contains a .dic and a .aff file
>>>>>>>> but
>>>>>>>> the configuration requeries a .dict and .affix file. I have tried to
>>>>>>>> change
>>>>>>>> the endings but with no success.
>>>>>>>>
>>>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
>>>>>>>>
>>>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ?
>>>>>>>>
>>>>>>>> 4) I have read that when using dictionaries, if a word is not
>>>>>>>> recognized
>>>>>>>> by
>>>>>>>> any dictionary it will not be indexed. I find that troublesome. I
>>>>>>>> would
>>>>>>>> like
>>>>>>>> everything but the stop words to be indexed. I guess this might be a
>>>>>>>> step
>>>>>>>> that I am not ready for yet, but just wanted to put it out there.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Also I would like to know how the process of the fulltext search
>>>>>>>> implementation looks like, from config to search.
>>>>>>>>
>>>>>>>> Create dictionary, then a text configuration, add dic to
>>>>>>>> configuration,
>>>>>>>> index columns with gin or gist ...
>>>>>>>>
>>>>>>>> How does a search look like? Does it match against the gin/gist
>>>>>>>> index.
>>>>>>>> Have that index been built up using the dictionary/configuration, or
>>>>>>>> is
>>>>>>>> the
>>>>>>>> dictionary only used on search frases?
>>>>>>>>
>>>>>>>> / Moe
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Regards,
>>>>>> Oleg
>>>>>> _____________________________________________________________
>>>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>>>>> Sternberg Astronomical Institute, Moscow University, Russia
>>>>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>>>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>> Regards,
>>> Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Mohamed 2009-02-02 16:09:19 Re: Fulltext search configuration
Previous Message Oleg Bartunov 2009-02-02 15:26:31 Re: Fulltext search configuration