Re: Extra Vietnamese unaccent rules

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Dang Minh Huong <kakalot49(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Kha Nguyen <nlhkha(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extra Vietnamese unaccent rules
Date: 2017-05-29 01:47:40
Message-ID: CAEepm=0S_b04AjS-4acrjU+20FgamKwF5CiJz-cd=E4a1SOWMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, May 28, 2017 at 7:55 PM, Dang Minh Huong <kakalot49(at)gmail(dot)com> wrote:
> [Quoting Thomas]
>> You don't have to worry about decoding that line, it's all done in
>> that Python script. The problem is just in the function
>> is_letter_with_marks(). Instead of just checking if combining_ids[0]
>> is a plain letter, it looks like it should also check if
>> combining_ids[0] itself is a letter with marks. Also get_plain_letter
>> would need to be able to recurse to extract the "a".
>
> Thanks for reporting and lecture about unicode.
> I attached a patch as the instruction from Thomas. Could you confirm it.

- is_plain_letter(table[codepoint.combining_ids[0]]) and \
+ (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+ len(table[codepoint.combining_ids[0]].combining_ids) > 1) and \

Shouldn't you use "or is_letter_with_marks()", instead of "or len(...)
> 1"? Your test might catch something that isn't based on a 'letter'
(according to is_plain_letter). Otherwise this looks pretty good to
me. Please add it to the next commitfest.

I expect that some users in Vietnam will consider this to be a bugfix,
which raises the question of whether to backpatch it. Perhaps we
could consider fixing it for 10. Then users of older versions could
grab the rules file from 10 to use with 9.whatever if they want to do
that and reindex their data as appropriate.

> [Quoting Michael]
>> Actually, with the recent work that has been done with
>> unicode_norm_table.h which has been to transpose UnicodeData.txt into
>> user-friendly tables, shouldn't the python script of unaccent/ be
>> replaced by something that works on this table? This does a canonical
>> decomposition but just keeps the first characters with a class
>> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
>> and let caller do decision making with the result returned.
>
> Thanks, i will learning about it.

It seems like that could be useful for runtime use (I'm sure there is
a whole world of Unicode support we could add), but here we're only
trying to generate a mapping file to add to the source tree, so I'm
not sure how it's relevant.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-05-29 03:20:41 pgsql: Code review focused on new node types added by partitioning supp
Previous Message Jeff Janes 2017-05-29 01:33:51 Re: logical replication - still unstable after all these months