Re: Extra Vietnamese unaccent rules

From: Kha Nguyen <nlhkha(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extra Vietnamese unaccent rules
Date: 2017-05-26 22:40:34
Message-ID: B7A3AD71-931B-4559-96FF-2E9D2B179651@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Does this mean that the python script has to be updated to be recursive too?

> On 27 May 2017, at 0.48, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlhkha(at)gmail(dot)com> wrote:
>> Could you explain to me what this line means:
>> “
>> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
>> 0301;;;;N;;;1EA4;;1EA4
>> “
>>
>> If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this unaccent format generation yet.
>
> So contrib/unaccent/generate_unaccent_rules.py is a Python script that
> takes UnicodeData.txt, a list of information about all Unicode
> codepoints available at a URL that is shown in a comment, and
> generates unaccent.rules. The idea was to avoid having to change it
> manually every time someone finds characters that should be in there
> (as you have just done!) by doing it systematically.
>
> Unicode has two ways to represent characters with accents: either with
> composed codepoints like "é" or decomposed codepoints where you say
> "e" and then "´". The field "00E2 0301" is the decomposed form of
> that character above. Our job here is to identify the basic letter
> that each composed character contains, by analysing the decomposed
> field that you see in that line. I failed to realise that characters
> with TWO accents are described as a composed character with ONE accent
> plus another accent.
>
> You don't have to worry about decoding that line, it's all done in
> that Python script. The problem is just in the function
> is_letter_with_marks(). Instead of just checking if combining_ids[0]
> is a plain letter, it looks like it should also check if
> combining_ids[0] itself is a letter with marks. Also get_plain_letter
> would need to be able to recurse to extract the "a".
>
> I hope that helps!
>
> --
> Thomas Munro
> http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2017-05-26 23:25:14 logical replication busy-waiting on a lock
Previous Message Thomas Munro 2017-05-26 21:48:34 Re: Extra Vietnamese unaccent rules