Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

From: Mohamed <mohamed5432154321(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..
Date: 2009-01-09 15:30:56
Message-ID: 861fed220901090730g3a4125d7xfe59484af0a7ed98@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you for you detailed answer. I have learned alot more about this stuff
now :)
As I see it accordingly to the results it's between Hunspell and Aspell. My
Aspell version is 0.6 released 2006. The Hunspell was released in 2008.

When I run the Postgres command \dFt I get the following list :

- ispell
- simple
- snowball
- synonym
- thesaurus

So I set up my dictionary with the ispell as a template and Hunspell/Aspell
files. Now I just have one decision to make :)

Just another thing:

> If you want to support multiple language dictionaries for a single table,
> with each row associated to its own dictionary
>

Not really, since the two languages don't overlap, couldn't I set up two
separate dictionaries and index against both on the whole table ? I think
that's what Oleg was refering to. Not sure...

Thanks for all the help / Moe

Ps. I can't Arabic so I can't have a look on the files to decide :O

On Fri, Jan 9, 2009 at 2:14 PM, Andrew <archa(at)pacific(dot)net(dot)au> wrote:

> Hi Mohammed,
>
> See my answers below, and hopefully they won't lead you too far astray.
> Note though, it has been a long time since I have done this and there are
> doubtless more knowledgeable people in this forum who will be able to
> correct anything I say that may be misleading or incorrect.
>
> Cheers,
>
> Andy
>
> Mohamed wrote:
>
> no one ?
>
> / Moe
>
>
> On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321(at)gmail(dot)com>wrote:
>
>> Ok, thank you all for your help. It has been very valuable. I am starting
>> to get the hang of it and almost read the whole chapter 12 + extras but I
>> still need a little bit of guidance.
>>
>> I have now these files :
>>
>> - A arabic Hunspell rar file (OpenOffice version) wich includes :
>> - ar.dic
>> - ar.aff
>> - An Aspell rar file that includes alot of files
>> - A Myspell ( says simple words list )
>> - And also Andrews two files :
>> - ar.affix
>> - ar.stop
>>
>> I am thinking that I should go with just one of these right and that
>> should be the Hunspell?
>>
> Hunspell is based on MySpell, extending it with support for complex
> compound words and unicode characters, however Postgresql cannot take
> advantage of Hunspell's compound word capabilities at present. Aspell is a
> GNU dictionary that replaces Ispell and supports UTF-8 characters. See
> http://aspell.net/test/ for comparisons between dictionaries, though be
> aware this test is hosted by Aspell... I will leave it to others to argue
> the merits of Hunspell vs. Aspell, and why you would choose one or the
> other.
>
> There is an ar.aff file there and Andrews file ends with .affix, are
>> those perhaps similiar? Should I skip Andrews ?
>>
> The ar.aff file that comes with OpenOffice Hunspell dictionary is
> essentially the same as the ar.affix I supplied. Just open the two up,
> compare them and choose the one that you feel is best. A Hunspell
> dictionary will work better with a corresponding affix file.
>
> Use just the ar.stop file ?
>>
> The ar.stop file flags common words from being indexed. You will want a
> stop file as well as the dictionary and affix file. Feel free to modify the
> stop file to meet your own needs.
>
>
>> On the Arabic / English on row basis language search approach, I will
>> skip and choose the approach suggested by Oleg :
>>
>> if arabic and english characters are not overlaped, you can use one
>>> index.
>>>
>>
>> The Arabic letters and English letters or words don't overlap so that
>> should not be an issue? Will I be able to index and search against both
>> languages in the same query?
>>
> If you want to support multiple language dictionaries for a single
> table, with each row associated to its own dictionary, use the
> tsvector_update_trigger_column trigger to automatically update your tsvector
> indexed column on insert or update. To support this, your table will need
> an additional column of type regconfig that contains the name of the
> dictionary to use when searching on the tsvector column for that particular
> row. See
> http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERSfor more details. This will allow you to search across both languages in
> the one query as you were asking.
>
>
>> And also
>>
>> 1. What language files should I use ?
>> 2. How does my create dictionary for the arabic language look like ?
>> Perhaps like this :
>>
>> CREATE TEXT SEARCH DICTIONARY arabic_dic(
>> TEMPLATE = ? , // Not sure what this means
>> DictFile = ar, // referring to ar.dic (hunspell)
>> AffFile = ar , // referring to ar.aff (hunspell)
>> StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )
>>
>> // Anything more ?
>> );
>>
>>
> From psql command line you can find out what templates you have using the
> following command:
>
> \dFt
>
> or looking at the contents of the pg_ts_template table.
>
> If choosing a Hunspell or Aspell dictionary, I believe a value of TEMPLATE
> = ispell should be okay for you - see
> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY.
> The template provides instructions to postgresql on how to interact with the
> dictionary. The rest of the create dictionary statement appears fine to me.
>
> Thanks again! / Moe
>>
>>
> ------------------------------
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
>
> Version: 8.0.176 / Virus Database: 270.10.3/1879 - Release Date: 1/6/2009 5:16 PM
>
>
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Mohamed 2009-01-09 15:44:17 Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..
Previous Message Reg Me Please 2009-01-09 15:25:38 Re: Thanx for 8.3