Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 18:08:00
Message-ID: CAAhbUMN1n=ZVns-OeCbaVRYPS0oj7tTnmJrzw7Az-op4DHC+JA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> Cool. Please add it to the current CF so we don't forget about it:
> https://commitfest.postgresql.org/21/

Done.

> Me too -- seems like that bears looking into. Perhaps the script's
> results are platform dependent -- what were you testing on?
>
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
categories (
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html#General%20Category)
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.

In looking more closely, I also see that script isn't generating ligatures,
even though it should, because although the program can generate them, none
of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so
they are skipped.

This could probably be handled by adding the ligature ranges to the defined
ranges. Symbol types could be added to the types it looks at, and perhaps
the codepoint ranges collapsed into one list, as the IDs are unique across
all categories. I don't think we'd want to just rely on ranges, as that
could include control characters, punctuation, etc.

There are a number of other characters that appear in unaccent.rules that
aren't generated by the script. I've attached a diff of the output of
generate_unaccent_rules (using the version before my changes, to simplify
matters) and unaccent.rules. Unfortunately, I don't know how to interpret
most of these characters.

I suppose it's valid to ask if changing © to (C) is even something an
"unaccent" function should do. Given that it's in the existing rules file,
should it be supported as an existing behaviour?

Sorry for more questions than answers. ;-)

Attachment Content-Type Size
unaccent.diff text/x-patch 5.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2018-12-15 18:44:48 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Tom Lane 2018-12-14 22:50:03 Re: BUG #15548: Unaccent does not remove combining diacritical characters

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-15 18:31:29 Re: Improving collation-dependent indexes in system catalogs
Previous Message Tom Lane 2018-12-15 17:35:09 Improving collation-dependent indexes in system catalogs