Re: BUG #13440: unaccent does not remove all diacritics

From: Léonard Benedetti <benedetti(at)mlpo(dot)fr>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2016-01-24 03:47:40
Message-ID: 56A4495C.8020705@mlpo.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

24/01/2016 04:18, Léonard Benedetti wrotes :
> Le 19/06/2015 04:00, Thomas Munro a écrit :
>> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I took a quick look at this list and it seems fairly sane as far as
>>> the automatically-generated items go, except that I see it hits a few
>>> LIGATURE cases (including the existing ij cases, but also fi fl and
>>> ffl). I'm still quite dubious that that is appropriate; at least, if
>>> we do it I think we should be expanding out to the equivalent
>>> multi-letter form, not simply taking one of the letters and dropping
>>> the rest. Anybody else have an opinion on how to handle ligatures?
>> Here is a version that optionally expands ligatures if asked to with
>> --expand-ligatures. It uses the Unicode 'general category' data to
>> identify and strip diacritical marks and distinguish them from
>> ligatures which are expanded to all their parts. It meant I had to
>> load a bunch of stuff into memory up front, but this approach can
>> handle an awkward bunch of ligatures whose component characters have
>> marks: DŽ, Dž, dž -> DZ, Dz, dz. (These are considered to be single
>> characters to maintain a one-to-one mapping with certain Cyrillic
>> characters in some Balkan countries which use or used both scripts.)
>>
>> As for whether we *should* expand ligatures, I'm pretty sure that's
>> what I'd always want, but my only direct experience of languages with
>> ligatures as part of the orthography (rather than ligatures as a
>> typesetting artefact like ffl et al) is French, where œ is used in the
>> official spelling of a bunch of words like œil, sœur, cœur, œuvre when
>> they appear in books, but substituting oe is acceptable on computers
>> because neither the standard French keyboard nor the historically
>> important Latin1 character set includes the character. I'm fairly
>> sure the Dutch have a similar situation with IJ, it's completely
>> interchangeable with the sequence IJ.
>>
>> So +1 from me for ligature expansion. It might be tempting to think
>> that a function called 'unaccent' should only remove diacritical
>> marks, but if we are going to be pedantic about it, not all
>> diacritical marks are actually accents anyway...
>>
>>> The manually added special cases don't look any saner than they did
>>> before :-(. Anybody have an objection to removing those (except maybe
>>> dotless i) in HEAD?
>> +1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but:
>>
>> 1. For some reason œ, æ (and uppercase equivalents) don't have
>> combining character data in the Unicode file, so they still need to be
>> treated as special cases if we're going to include ligatures. Their
>> expansion should of course be oe and ae rather that what we have.
>> 2. Likewise ß still needs special treatment (it may be historically
>> composed of sz but Unicode doesn't know that, it's its own character
>> now and expands to ss anyway).
>> 3. I don't see any reason to drop the Afrikaans ʼn, though it should
>> surely be expanded to 'n rather than n.
>> 4. I have no clue about whether the single Cyrillic item in there
>> belongs there.
>>
>> Just by the way, there are conventional rules for diacritic removal in
>> some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
>> Scandinavian languages and è -> e' in Italian. A German friend of
>> mine has a ü in his last name and he finishes up with any of three
>> possible spellings of his name on various official documents, credit
>> cards etc as a result! But these sorts of things are specific to
>> individual languages and don't belong in a general accent removal rule
>> file (it would be inappropriate to convert French aigüe to aiguee or
>> Spanish pingüino to pingueino). I guess speakers of those languages
>> could consider submitting rules files for language-specific
>> conventions like that.
>>
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).
>
> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterator.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.
>
> Léonard Benedetti
Here is the patch, attached.

Léonard Benedetti

Attachment Content-Type Size
improve-unaccent-default-rules-generation-script.patch text/x-patch 15.0 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message xtracoder 2016-01-24 10:47:02 BUG #13884: array_to_json() works incorrectly for non-0-based arrays
Previous Message Léonard Benedetti 2016-01-24 03:18:07 Re: BUG #13440: unaccent does not remove all diacritics