Skip site navigation (1) Skip section navigation (2)

Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

From: Mohamed <mohamed5432154321(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..
Date: 2009-01-09 15:44:17
Message-ID: 861fed220901090744q3cd47ca4m4f86609d8d1bd5ad@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-general
Thank you for you detailed answer. I have learned alot more about this stuff
now :)
As I see it accordingly to the results it's between Hunspell and Aspell. My
Aspell version is 0.6 released 2006. The Hunspell was released in 2008.

When I run the Postgres command \dFt I get the following list :

   - ispell
   - simple
   - snowball
   - synonym
   - thesaurus


So I set up my dictionary with the ispell as a template and Hunspell/Aspell
files. Now I just have one decision to make :)

Just another thing:

> If you want to support multiple language dictionaries for a single table,
> with each row associated to its own dictionary
>

Not really, since the two languages don't overlap, couldn't I set up two
separate dictionaries and index against both on the whole table ? I think
that's what Oleg was refering to. Not sure...

Thanks for all the help / Moe

Ps. I can't read Arabic so I can't have a look on the files to decide :O




On Fri, Jan 9, 2009 at 2:14 PM, Andrew <archa(at)pacific(dot)net(dot)au> wrote:

>  Hi Mohammed,
>
> See my answers below, and hopefully they won't lead you too far astray.
> Note though, it has been a long time since I have done this and there are
> doubtless more knowledgeable people in this forum who will be able to
> correct anything I say that may be misleading or incorrect.
>
> Cheers,
>
> Andy
>
> Mohamed wrote:
>
> no one ?
>
>  / Moe
>
>
> On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321(at)gmail(dot)com>wrote:
>
>> Ok, thank you all for your help. It has been very valuable. I am starting
>> to get the hang of it and almost read the whole chapter 12 + extras but I
>> still need a little bit of guidance.
>>
>>  I have now these files :
>>
>>    - A arabic Hunspell rar file (OpenOffice version) wich includes :
>>     - ar.dic
>>        - ar.aff
>>    - An Aspell rar file that includes alot of files
>>    - A Myspell ( says simple words list )
>>    - And also Andrews two files :
>>       - ar.affix
>>       - ar.stop
>>
>> I am thinking that I should go with just one of these right and that
>> should be the Hunspell?
>>
>   Hunspell is based on MySpell, extending it with support for complex
> compound words and unicode characters, however Postgresql cannot take
> advantage of Hunspell's compound word capabilities at present.  Aspell is a
> GNU dictionary that replaces Ispell and supports UTF-8 characters.  See
> http://aspell.net/test/ for comparisons between dictionaries, though be
> aware this test is hosted by Aspell...  I will leave it to others to argue
> the merits of Hunspell vs. Aspell, and why you would choose one or the
> other.
>
>    There is an ar.aff file there and Andrews file ends with .affix, are
>> those perhaps similiar? Should I skip Andrews ?
>>
>   The ar.aff file that comes with OpenOffice Hunspell dictionary is
> essentially the same as the ar.affix I supplied.  Just open the two up,
> compare them and choose the one that you feel is best.  A Hunspell
> dictionary will work better with a corresponding affix file.
>
>   Use just the ar.stop file ?
>>
>   The ar.stop file flags common words from being indexed.  You will want a
> stop file as well as the dictionary and affix file.  Feel free to modify the
> stop file to meet your own needs.
>
>
>>  On the Arabic / English on row basis language search approach, I will
>> skip and choose the approach suggested by Oleg  :
>>
>>  if arabic and english characters are not overlaped, you can use one
>>> index.
>>>
>>
>>  The Arabic letters and English letters or words don't overlap so that
>> should not be an issue? Will I be able to index and search against both
>> languages in the same query?
>>
>   If you want to support multiple language dictionaries for a single
> table, with each row associated to its own dictionary, use the
> tsvector_update_trigger_column trigger to automatically update your tsvector
> indexed column on insert or update.  To support this, your table will need
> an additional column of type regconfig that contains the name of the
> dictionary to use when searching on the tsvector column for that particular
> row.  See
> http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERSfor more details.  This will allow you to search across both languages in
> the one query as you were asking.
>
>
>>  And also
>>
>>    1. What language files should I use ?
>>    2. How does my create dictionary for the arabic language look like ?
>>    Perhaps like this :
>>
>>  CREATE TEXT SEARCH DICTIONARY arabic_dic(
>>     TEMPLATE = ? , // Not sure what this means
>>     DictFile = ar, // referring to ar.dic  (hunspell)
>>     AffFile = ar , // referring to ar.aff  (hunspell)
>>     StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )
>>
>>     // Anything more ?
>> );
>>
>>
> From psql command line you can find out what templates you have using the
> following command:
>
> \dFt
>
> or looking at the contents of the pg_ts_template table.
>
> If choosing a Hunspell or Aspell dictionary, I believe a value of TEMPLATE
> = ispell should be okay for you - see
> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY.
> The template provides instructions to postgresql on how to interact with the
> dictionary.  The rest of the create dictionary statement appears fine to me.
>
>   Thanks again! / Moe
>>
>>
>  ------------------------------
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
>
> Version: 8.0.176 / Virus Database: 270.10.3/1879 - Release Date: 1/6/2009 5:16 PM
>
>
>
>
>

In response to

Responses

pgsql-general by date

Next:From: Kevin GrittnerDate: 2009-01-09 15:58:21
Subject: Re: Improving compressibility of WAL files
Previous:From: MohamedDate: 2009-01-09 15:30:56
Subject: Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group