Re: Extra Vietnamese unaccent rules

From: Dang Minh Huong <kakalot49(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Kha Nguyen <nlhkha(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extra Vietnamese unaccent rules
Date: 2017-07-05 15:45:17
Message-ID: 7a813796-80c5-aa93-8772-bbddf2f6a10f@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017/07/05 15:28, Michael Paquier wrote:
> I have finally been able to look at this patch.

Thanks for reviewing and the new version of the patch.
> (Surprised to see that generate_unaccent_rules.py is inconsistent on
> MacOS, runs fine on Linux).
>
> def get_plain_letter(codepoint, table):
> """Return the base codepoint without marks."""
> if is_letter_with_marks(codepoint, table):
> - return table[codepoint.combining_ids[0]]
> + if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
> + # Recursive to find the plain letter
> + return get_plain_letter(table[codepoint.combining_ids[0]],table)
> + elif is_plain_letter(table[codepoint.combining_ids[0]]):
> + return table[codepoint.combining_ids[0]]
> + else:
> + return None
> elif is_plain_letter(codepoint):
> return codepoint
> else:
> - raise "mu"
> + return None
> The code paths returning None should not be reached, so I would
> suggest adding an assertion instead. Callers of get_plain_letter would
> blow up on None, still that would make future debugging harder.
>
> def is_letter_with_marks(codepoint, table):
> - """Returns true for plain letters combined with one or more marks."""
> + """Returns true for letters combined with one or more marks."""
> # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
> return len(codepoint.combining_ids) > 1 and \
> - is_plain_letter(table[codepoint.combining_ids[0]]) and \
> + (is_plain_letter(table[codepoint.combining_ids[0]]) or\
> + is_letter_with_marks(table[codepoint.combining_ids[0]],table))
> and \
> all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
> This was already hard to follow, and this patch makes its harder. I
> think that the thing should be refactored with multiple conditions.
>
> if is_letter_with_marks(codepoint, table):
> - charactersSet.add((codepoint.id,
> + if get_plain_letter(codepoint, table) <> None:
> + charactersSet.add((codepoint.id,
> This change is not necessary as a letter with marks is not a plain
> character anyway.
>
> Testing with characters having two accents, the results are produced
> as wanted. I am attaching an updated patch with all those
> simplifications. Thoughts?

Thanks, so pretty. The patch is fine to me.

---
Thanks and best regards,
Dang Minh Huong

---
このEメールはアバスト アンチウイルスによりウイルススキャンされています。
https://www.avast.com/antivirus

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2017-07-05 16:29:52 Suspicious place in heap_prepare_freeze_tuple()
Previous Message Ryan Murphy 2017-07-05 15:45:15 Re: Error-like LOG when connecting with SSL for password authentication