Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

From: Andrew <archa(at)pacific(dot)net(dot)au>
To: Mohamed <mohamed5432154321(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..
Date: 2009-01-09 13:14:42
Message-ID: 49674DC2.7060104@pacific.net.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Mohammed,

See my answers below, and hopefully they won't lead you too far astray.
Note though, it has been a long time since I have done this and there
are doubtless more knowledgeable people in this forum who will be able
to correct anything I say that may be misleading or incorrect.

Cheers,

Andy

Mohamed wrote:
> no one ?
>
> / Moe
>
>
> On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321(at)gmail(dot)com
> <mailto:mohamed5432154321(at)gmail(dot)com>> wrote:
>
> Ok, thank you all for your help. It has been very valuable. I am
> starting to get the hang of it and almost read the whole chapter
> 12 + extras but I still need a little bit of guidance.
>
> I have now these files :
>
> * A arabic Hunspell rar file (OpenOffice version) wich includes :
> o ar.dic
> o ar.aff
> * An Aspell rar file that includes alot of files
> * A Myspell ( says simple words list )
> * And also Andrews two files :
> o ar.affix
> o ar.stop
>
> I am thinking that I should go with just one of these right and
> that should be the Hunspell?
>
Hunspell is based on MySpell, extending it with support for complex
compound words and unicode characters, however Postgresql cannot take
advantage of Hunspell's compound word capabilities at present. Aspell
is a GNU dictionary that replaces Ispell and supports UTF-8 characters.
See http://aspell.net/test/ for comparisons between dictionaries, though
be aware this test is hosted by Aspell... I will leave it to others to
argue the merits of Hunspell vs. Aspell, and why you would choose one or
the other.

> There is an ar.aff file there and Andrews file ends with .affix,
> are those perhaps similiar? Should I skip Andrews ?
>
The ar.aff file that comes with OpenOffice Hunspell dictionary is
essentially the same as the ar.affix I supplied. Just open the two up,
compare them and choose the one that you feel is best. A Hunspell
dictionary will work better with a corresponding affix file.
>
> Use just the ar.stop file ?
>
The ar.stop file flags common words from being indexed. You will want a
stop file as well as the dictionary and affix file. Feel free to modify
the stop file to meet your own needs.
>
>
> On the Arabic / English on row basis language search approach, I
> will skip and choose the approach suggested by Oleg :
>
> if arabic and english characters are not overlaped, you can
> use one index.
>
>
> The Arabic letters and English letters or words don't overlap so
> that should not be an issue? Will I be able to index and search
> against both languages in the same query?
>
If you want to support multiple language dictionaries for a single
table, with each row associated to its own dictionary, use the
tsvector_update_trigger_column trigger to automatically update your
tsvector indexed column on insert or update. To support this, your
table will need an additional column of type regconfig that contains the
name of the dictionary to use when searching on the tsvector column for
that particular row. See
http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERS
for more details. This will allow you to search across both languages
in the one query as you were asking.
>
>
> And also
>
> 1. What language files should I use ?
> 2. How does my create dictionary for the arabic language look
> like ? Perhaps like this :
>
> CREATE TEXT SEARCH DICTIONARY arabic_dic(
> TEMPLATE = ? , // Not sure what this means
> DictFile = ar, // referring to ar.dic (hunspell)
> AffFile = ar , // referring to ar.aff (hunspell)
> StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )
>
> // Anything more ?
> );
>

From psql command line you can find out what templates you have using
the following command:

\dFt

or looking at the contents of the pg_ts_template table.

If choosing a Hunspell or Aspell dictionary, I believe a value of
TEMPLATE = ispell should be okay for you - see
http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY.
The template provides instructions to postgresql on how to interact with
the dictionary. The rest of the create dictionary statement appears
fine to me.

> Thanks again! / Moe
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.176 / Virus Database: 270.10.3/1879 - Release Date: 1/6/2009 5:16 PM
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2009-01-09 14:31:32 Re: Improving compressibility of WAL files
Previous Message Emanuel Calvo Franco 2009-01-09 12:30:27 Re: dblink between oracle and postgres?