Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, thomas(dot)munro(at)enterprisedb(dot)com, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-01-10 02:52:05
Message-ID: CAAhbUMNZ0ooK6SzLNdkxzdBsQHOJf_rg_EjwoNL8QHTwQuriRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, 8 Jan 2019 at 22:53, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> I have been doing a bit more than a review by studying by myself the
> new format and the old format, and the way we could do things in the
> XML parsing part, and hacked the code by myself. On top of the
> incorrect URL for Latin-ASCII.xml, I have noticed as well that there
> should be only one block transforms/transform/tRule in the source, so
> I think that we should add an assertion on that as a sanity check. I
> have also changed the code to use splitlines(), which is more portable
> across platforms, and added an extra regression test for the new
> characters added to unaccent.rules. This does not close this thread
> but we can support the new format this way. I have also documented
> the way to browse the full set of releases for Latin-ASCII.xml, and
> precisely which version has been used for this patch.
>
> This does not close yet the part for diacritical characters, but
> supporting the new format is a step into this direction. What do
> you think?
>
HI Michael,
Thank you for putting so much effort into this. I think that looks great.
When I was doing this, I discovered that I could parse both pre- and post-
r29 versions, so I went with that, but I agree that there's probably no
good reason to do so.

And thank you for the information on splitlines; that's a method I've
overlooked. .split('\n') should be identical, if python is, as usual,
compiled with universal newlines support, but it's nice to have a method
guaranteed to work in all instances.

Best wishes,
Hugh

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-01-10 03:29:45 Re: BUG #15577: Query returns different results when executed multiple times
Previous Message Masahiko Sawada 2019-01-10 02:47:32 Re: Is temporary functions feature official/supported? Found some issues with it.

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2019-01-10 03:13:44 Re: BTW, have we got a commitfest manager for the January CF?
Previous Message Thomas Munro 2019-01-10 02:24:19 Re: Early WIP/PoC for inlining CTEs