Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

From: Mohamed <mohamed5432154321(at)gmail(dot)com>
To: Andrew <archa(at)pacific(dot)net(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..
Date: 2009-02-03 18:50:46
Message-ID: 861fed220902031050o11d720c6g70e289010a4476d1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I finally got around to build a configuration but the results are not good
at all and a bit odd.

Here is what I did:

I built the configuration with the hunspell + an Arabic simple dictionary
(with just the stop words as an input) because I noticed that words not
recognized will still get returned back.

Removed from the affix file :
Flag long

CREATE TEXT SEARCH DICTIONARY hunar (
TEMPLATE = ispell,
DictFile = hunar,
AffFile = hunar,
StopWords = ar
);

CREATE TEXT SEARCH DICTIONARY ar_simple (
TEMPLATE = pg_catalog.simple, //Not sure what this is or does
STOPWORDS = ar
);

CREATE TEXT SEARCH CONFIGURATION hunarconfig ( COPY = pg_catalog.english );

ALTER TEXT SEARCH CONFIGURATION hunarconfig
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH hunar, ar_simple;

Running :

SELECT * FROM ts_debug('hunarconfig', '
وفي هذا الإطار أجرى رئيس الوزراء القطري الشيخ حمد بن جاسم بن جبر آل ثاني
محادثات في لندن مع نظيره البريطاني غوردون براون تناولت الأوضاع الأمنية في
الشرق الأوسط وتطرقت المباحثات إلى سبل تثبيت وقف إطلاق النار في قطاع غزة
وعملية إعادة إعمار وبناء القطاع بعد الحرب الإسرائيلية الأخيرة.
');

returned odd results ( I think). Not many was recognized by the hunar
dictionary and
some stopwords where recognized by the latter dictionary ar_simple even
though the same stopwords file was used in the hunar dictionary. Should I
not expect the stopwords to be recognized by hunar and not ar_simple ?

Here is a small sample that shows what I mean (with comments) :

*"**وفي"; "{hunar,ar_simple}"; "hunar"; "{}"
// Recognized stop word by hunar dictionary*

*"**هذا"; "{hunar,ar_simple}"; "ar_simple"; "{}"
// Recognized stop word but by ar_simple ? WHY?*

*"**أجرى"; "{hunar,ar_simple}"; "ar_simple"; "{**أجرى}"
// Not recognized by any, return*

Is this not strange? Shouldn't the first dictionary (hunar) return the
stopwords recognized and not ar_simple?

/ Moe

On Sat, Jan 10, 2009 at 11:14 AM, Andrew <archa(at)pacific(dot)net(dot)au> wrote:

> Mohamed wrote:
>
> Thank you for you detailed answer. I have learned alot more about this
> stuff now :)
>
> Your welcome :-)
>
>
> As I see it accordingly to the results it's between Hunspell and Aspell.
> My Aspell version is 0.6 released 2006. The Hunspell was released in 2008.
>
> When I run the Postgres command \dFt I get the following list :
>
> - ispell
> - simple
> - snowball
> - synonym
> - thesaurus
>
>
> So I set up my dictionary with the ispell as a template and
> Hunspell/Aspell files. Now I just have one decision to make :)
>
> Just another thing:
>
>> If you want to support multiple language dictionaries for a single table,
>> with each row associated to its own dictionary
>>
>
> Not really, since the two languages don't overlap, couldn't I set up two
> separate dictionaries and index against both on the whole table ? I think
> that's what Oleg was refering to. Not sure...
>
> Neither am I, so when in doubt, try it out. And let us know the results.
>
>
> Thanks for all the help / Moe
>
> Ps. I can't read Arabic so I can't have a look on the files to decide :O
>
> In which case, assuming you do not have access to a friend who is able
> to read Arabic, either choose the file with the most entries (making
> assumption that more is better) or take the one that came with the
> dictionary (assuming that those two will be best matched) or if you still
> can't decide, flip a coin. As you can't read Arabic, it is not as if you
> are in a position to put both files through their paces and test them
> against a word list, picking the one that gives you the best results for the
> type of words your text is likely to contain.
>
> Cheers,
>
> Andy
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jack Orenstein 2009-02-03 18:53:38 Re: LIKE with pattern containing backslash
Previous Message Phoenix Kiula 2009-02-03 18:31:03 Re: Vacuums taking forever :(