Re: BUG #13440: unaccent does not remove all diacritics

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Léonard Benedetti <benedetti(at)mlpo(dot)fr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2016-01-25 23:44:50
Message-ID: CAEepm=3Th+3XRiOoXewLvL1DybCbKxjc0FE4o6XqaZZBLUSOvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sun, Jan 24, 2016 at 4:18 PM, Léonard Benedetti <benedetti(at)mlpo(dot)fr> wrote:
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).

Wow. It would indeed be nice to use this dataset rather than
maintaining the special cases for œ et al. It would also nice to pick
up all those other things like ©, ½, …, ≪, ≫ (though these stray a
little bit further from the functionality implied by unaccent's name).
I don't think this alone will completely get rid of the hardcoded
special cases though, because we have these two mappings which look
like Latin but are in fact Cyrillic and I assume we need to keep them:

Ё Е
ё е

Should we extend the composition data analysis to make these remaining
special cases go away? We'd need a definition of is_plain_letter that
returns True for 0415 so that 0401 can be recognised as 0415 + 0308.
Depending on how you do that, you could sweep in some more Cyrillic
mappings and a ton of stuff from other scripts that have precomposed
diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need
someone with knowledge of relevant languages to sign off on the result
-- so it might make sense to stick to a definition that includes just
Latin and Cyrillic for now.

(Otherwise it might be tempting to use *only* the transliterator
approach, but CLDR doesn't seem to have appropriate transliterator
files for other scripts. They have for example Cyrillic -> Latin, but
we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin ->
ASCII.)

> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterator.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.

+1

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Vladimir Bilyak 2016-01-26 06:35:22 Re[2]: [BUGS] BUG #13889: psql doesn't exequte correct script
Previous Message Peter Geoghegan 2016-01-25 22:42:04 Re: BUG #13886: When INSERT ON CONFLICT DO UPDATE updates, it returns INSERT rather than UPDATE