Re: Extra Vietnamese unaccent rules

From: Kha Nguyen <nlhkha(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extra Vietnamese unaccent rules
Date: 2017-05-26 21:09:37
Message-ID: 262536FD-F41D-4776-9056-9FBA60DA61EA@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Could you explain to me what this line means:

1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301;;;;N;;;1EA4;;1EA4

If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this unaccent format generation yet.

Thanks
Kha

> On 26 May 2017, at 21.19, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Sat, May 27, 2017 at 5:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I wrote:
>>> Nguyen Le Hoang Kha <nlhkha(at)gmail(dot)com> writes:
>>>> Most of the time in Vietnamese language, there are up to 2 accents in a
>>>> character. These unaccent rules are added to handle such cases (which are
>>>> very common).
>>
>>> I can't see any reason not to add these --- any objections out there?
>>
>> Oh, wait a minute. Patching unaccent.rules directly isn't the way
>> to do this; that file is supposed to be generated by
>> generate_unaccent_rules.py. Can you see how to modify that script
>> to produce these rules?
>
> Looking at one example from this patch:
>
> UTF8: <E1><BA><A5>
> Codepoint: 1EA5
> Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
>
> In UnicodData.txt it's this line:
>
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301;;;;N;;;1EA4;;1EA4
>
> The problem is that generate_unaccent_rules.py assumes that the
> composing data is a plain letter followed by some number of
> diacritical modifiers. That's true for the characters with a single
> accent, but in this multi-accent case it's *composed* character 00E2
> (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
> (COMBINING ACCENT ACUTE). So we need to teach it to be recursive.
>
> --
> Thomas Munro
> http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2017-05-26 21:45:41 Re: logical replication - still unstable after all these months
Previous Message Amit Kapila 2017-05-26 21:06:56 Re: Regarding Postgres Dynamic Shared Memory (DSA)