|From:||Hugh Ranalli <hugh(at)whtc(dot)ca>|
|To:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Cc:||Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-bugs(at)lists(dot)postgresql(dot)org|
|Subject:||Re: BUG #15548: Unaccent does not remove combining diacritical characters|
|Views:||Raw Message | Whole Thread | Download mbox|
On Fri, 14 Dec 2018 at 17:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> Cool. Please add it to the current CF so we don't forget about it:
> Me too -- seems like that bears looking into. Perhaps the script's
> results are platform dependent -- what were you testing on?
I'm on Linux Mint 17, which is based on Ubuntu 14.04. But I don't think
that's it. The program's decisions come from the two data files, the
Unicode data set and the Latin-ASCII transliteration file. The script uses
to identify letters (and now combining marks) and if they are in range,
performs a substitution. It then uses the transliteration file to find
rules for particular character substitutions (for example, that file seems
to handle the copyright symbol substitution). I don't see anything platform
dependent in there.
In looking more closely, I also see that script isn't generating ligatures,
even though it should, because although the program can generate them, none
of the ligatures are in the ranges defined in PLAIN_LETTER_RANGES, and so
they are skipped.
This could probably be handled by adding the ligature ranges to the defined
ranges. Symbol types could be added to the types it looks at, and perhaps
the codepoint ranges collapsed into one list, as the IDs are unique across
all categories. I don't think we'd want to just rely on ranges, as that
could include control characters, punctuation, etc.
There are a number of other characters that appear in unaccent.rules that
aren't generated by the script. I've attached a diff of the output of
generate_unaccent_rules (using the version before my changes, to simplify
matters) and unaccent.rules. Unfortunately, I don't know how to interpret
most of these characters.
I suppose it's valid to ask if changing © to (C) is even something an
"unaccent" function should do. Given that it's in the existing rules file,
should it be supported as an existing behaviour?
Sorry for more questions than answers. ;-)
|Next Message||Tom Lane||2018-12-15 18:44:48||Re: BUG #15548: Unaccent does not remove combining diacritical characters|
|Previous Message||Tom Lane||2018-12-14 22:50:03||Re: BUG #15548: Unaccent does not remove combining diacritical characters|
|Next Message||Tom Lane||2018-12-15 18:31:29||Re: Improving collation-dependent indexes in system catalogs|
|Previous Message||Tom Lane||2018-12-15 17:35:09||Improving collation-dependent indexes in system catalogs|