Re: Extra Vietnamese unaccent rules

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Kha Nguyen <nlhkha(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extra Vietnamese unaccent rules
Date: 2017-05-26 21:48:34
Message-ID: CAEepm=2o7gmoZaG+t0NJ=xUkLUBqMzg_s1aojm2CX7fkrnXoHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlhkha(at)gmail(dot)com> wrote:
> Could you explain to me what this line means:
> “
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301;;;;N;;;1EA4;;1EA4
> “
>
> If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this unaccent format generation yet.

So contrib/unaccent/generate_unaccent_rules.py is a Python script that
takes UnicodeData.txt, a list of information about all Unicode
codepoints available at a URL that is shown in a comment, and
generates unaccent.rules. The idea was to avoid having to change it
manually every time someone finds characters that should be in there
(as you have just done!) by doing it systematically.

Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´". The field "00E2 0301" is the decomposed form of
that character above. Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line. I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

You don't have to worry about decoding that line, it's all done in
that Python script. The problem is just in the function
is_letter_with_marks(). Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks. Also get_plain_letter
would need to be able to recurse to extract the "a".

I hope that helps!

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kha Nguyen 2017-05-26 22:40:34 Re: Extra Vietnamese unaccent rules
Previous Message Jeff Janes 2017-05-26 21:45:41 Re: logical replication - still unstable after all these months