Re: updating unaccent.rules for Arabic letters

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "kerbrose khaled" <kerbrose(at)hotmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: updating unaccent.rules for Arabic letters
Date: 2019-11-04 17:41:59
Message-ID: c2dfc689-4710-4a73-ad69-12807f36a289@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-translators

kerbrose khaled wrote:

> I would like to update unaccent.rules file to support Arabic letters. so
> could someone help me or tell me how could I add such contribution. I
> attached the file including the modifications, only the last 4 lines.

The Arabic letters are found in the Unicode block U+0600 to U+06FF
(https://www.fileformat.info/info/unicode/block/arabic/list.htm)
There has been no coverage of this block until now by the unaccent
module. Since Arabic uses several diacritics [1] , it would be best to
figure out all the transliterations that should go in and let them in
one go (plus coding that in the Python script).

The canonical way to unaccent is normally to apply a Unicode
transformation: NFC -> NFD and remove the non-spacing marks.

I've tentatively did that with each codepoint in the 0600-06FF block
in SQL with icu_transform in icu_ext [2], and it produces the
attached result, with 60 (!) entries, along with Unicode names for
readability.

Does that make sense to people who know Arabic?

For the record, here's the query:

WITH block(cp) AS (select * FROM generate_series(x'600'::int,x'6ff'::int) AS
cp),
dest AS (select cp, icu_transform(chr(cp), 'any-NFD;[:nonspacing mark:]
any-remove; any-NFC') AS unaccented FROM block)
SELECT
chr(cp) as "src",
icu_transform(chr(cp), 'Name') as "srcName",
dest.unaccented as "dest",
icu_transform(dest.unaccented, 'Name') as "destName"
FROM dest
WHERE chr(cp) <> dest.unaccented;

[1] https://en.wikipedia.org/wiki/Arabic_diacritics
[2] https://github.com/dverite/icu_ext#icu_transform

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Attachment Content-Type Size
unaccent-arabic-block.utf8.output application/octet-stream 5.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-11-04 17:53:48 Re: Keep compiler silence (clang 10, implicit conversion from 'long' to 'double' )
Previous Message Alvaro Herrera 2019-11-04 17:29:50 Re: Missed check for too-many-children in bgworker spawning

Browse pgsql-translators by date

  From Date Subject
Next Message Akshay Joshi 2019-11-08 13:20:59 Translators: Release next week
Previous Message Tom Lane 2019-11-03 16:12:15 Re: updating unaccent.rules for Arabic letters