Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: thomas(dot)munro(at)enterprisedb(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-20 22:39:36
Message-ID: CAAhbUMNyZ+PhNr_mQ=G161K0-hvbq13Tz2is9M3WK+yX9cQOCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Okay, I've tried to separate everything cleanly. The patches are numbered
in the order in which they should be applied. Each patch contains all the
updates appropriate to that version (i.e., if the change would modify
unaccent.rules, those changes are also in the patch):

01 - Updates generate_unaccent_rules.py to be Python 2 and 3 compatible.
The approach I have taken is "native" Python 3 compatibility with
adjustments for Python 2. There's a marked block at the beginning of the
file that can be removed whenever Python 2 support is dropped. I haven't
followed the recommended practice of importing the "past" or "future"
modules, as the changes are minimal, and these are just additional
dependencies that need to be installed separately, which didn't seem to
make sense for a utility script. This patch also updates sql/unaccent.sql
to UTF-8 format.

02 - Updates generate_unaccent_rules.py to work with all versions (I tested
r28 and r34) of the Latin-ASCII transliteration file. It also updates
unaccent.rules to have the output of the r34 transliteration file. This
patch should work without the 01 patch.

03 - Updates generate_unaccent_rules.py to remove combining diacritical
marks. It also updates unaccent.rules with the revised output, and adds
tests to sql/unaccent.sql. It will not work or apply if the 01 patch is not
applied. It should without the 02 patch.

When you look at unaccent.rules generated by the 03 version, there may
appear to be blank lines. I've checked and they're not blank. They are
characters which are only visible with other characters in front of them,
at least in my editor.

I'll go update the CommitFest now. I hope I've covered everything; please
let me know if there's anything I've missed.

Best wishes,
Hugh

Attachment Content-Type Size
01-generate-unaccent-rules-python2-and-3-01.patch text/x-patch 4.2 KB
02-generate_unaccent_rules-handle-all-Latin-ASCII-versions-01.patch text/x-patch 1.7 KB
03-generate_unaccent_rules-remove-combining-diacritical-accents-01.patch text/x-patch 3.9 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Etsuro Fujita 2018-12-21 03:49:25 Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction
Previous Message Tom Lane 2018-12-20 17:56:25 Re: BUG #15553: "ERROR: cache lookup failed for type 2" with a function the first time it run.

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-12-20 22:50:41 Re: GIN predicate locking slows down valgrind isolationtests tremendously
Previous Message Andres Freund 2018-12-20 22:33:59 Re: Tid scan improvements