Re: Extra Vietnamese unaccent rules

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Man Trieu <man(dot)trieu(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Dang Minh Huong <kakalot49(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Kha Nguyen <nlhkha(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Extra Vietnamese unaccent rules
Date: 2017-07-05 06:28:52
Message-ID: CAB7nPqR5GQgJtfckvrX8+kcYd-EV3Y_+Kq_VJRBP2dFGBwDGKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 7, 2017 at 1:06 AM, Man Trieu <man(dot)trieu(at)gmail(dot)com> wrote:
> 2017-06-07 0:31 GMT+09:00 Bruce Momjian <bruce(at)momjian(dot)us>:
>>
>> On Wed, Jun 7, 2017 at 12:10:25AM +0900, Dang Minh Huong wrote:
>> > > On Jun 4, 29 Heisei, at 00:48, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> > >>>> Shouldn't you use "or is_letter_with_marks()", instead of "or
>> > >>>> len(...)
>> > >>>>> 1"? Your test might catch something that isn't based on a
>> > >>>>> 'letter'
>> > >>>> (according to is_plain_letter). Otherwise this looks pretty good
>> > >>>> to
>> > >>>> me. Please add it to the next commitfest.
>> > >>>
>> > >>> Thanks for confirm, sir.
>> > >>> I will add it to the next CF soon.
>> > >>
>> > >> Sorry for lately response. I attach the update patch.
>> > >
>> > > Uh, there is no patch attached.
>> > >
>> >
>> > Sorry sir, reattach the patch.
>> > I also added it to the next CF and set reviewers to Thomas Munro. Could
>> > you confirm for me.
>>
>> There seems to be a problem. I can't see a patch dated 2017-06-07 on
>> the commitfest page:
>>
>> https://commitfest.postgresql.org/14/1161/
>>
>> I added the thread but there was no change. (I think the thread was
>> already present.) It appears it is not seeing this patch as the latest
>> patch.
>>
>> Does anyone know why this is happening?
>
> May be due to my Mac's mailer? Sorry but I try one more time to attach the
> patch by webmail.

I have finally been able to look at this patch.
(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).

def get_plain_letter(codepoint, table):
"""Return the base codepoint without marks."""
if is_letter_with_marks(codepoint, table):
- return table[codepoint.combining_ids[0]]
+ if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+ # Recursive to find the plain letter
+ return get_plain_letter(table[codepoint.combining_ids[0]],table)
+ elif is_plain_letter(table[codepoint.combining_ids[0]]):
+ return table[codepoint.combining_ids[0]]
+ else:
+ return None
elif is_plain_letter(codepoint):
return codepoint
else:
- raise "mu"
+ return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.

def is_letter_with_marks(codepoint, table):
- """Returns true for plain letters combined with one or more marks."""
+ """Returns true for letters combined with one or more marks."""
# See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
return len(codepoint.combining_ids) > 1 and \
- is_plain_letter(table[codepoint.combining_ids[0]]) and \
+ (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+ is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.

if is_letter_with_marks(codepoint, table):
- charactersSet.add((codepoint.id,
+ if get_plain_letter(codepoint, table) <> None:
+ charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.

Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?
--
Michael

Attachment Content-Type Size
unaccent-v2.patch text/x-patch 4.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-07-05 06:32:36 Re: Small patch for pg_basebackup argument parsing
Previous Message Ryan Murphy 2017-07-05 06:01:24 Re: CommitFest 2017-09 - How do I know what commit to apply patches to