Re: BUG #13440: unaccent does not remove all diacritics

From: Léonard Benedetti <benedetti(at)mlpo(dot)fr>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2016-02-11 21:05:41
Message-ID: 56BCF7A5.6020204@mlpo.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

TL;DR: Special cases that were not handled by the new script were added.
All characters handled by unaccent are now handled by the script, and as
well new ones.

26/01/2016 00:44, Thomas Munro wrote:
> Wow. It would indeed be nice to use this dataset rather than
> maintaining the special cases for œ et al. It would also nice to pick
> up all those other things like ©, ½, …, ≪, ≫ (though these stray a
> little bit further from the functionality implied by unaccent's name).
It is true that the file grows in size and offers more and more
characters. But as Alvaro Herrera said in a previous mail:

“To me, conceptually what unaccent does is turn whatever junk you have
into a very basic common alphabet (ascii); then it's very easy to do
full text searches without having to worry about what accents the people
did or did not use in their searches.”

and I think it makes sense. And since there is no significant
performance difference, I think we can continue on this way.
> I don't think this alone will completely get rid of the hardcoded
> special cases though, because we have these two mappings which look
> like Latin but are in fact Cyrillic and I assume we need to keep them:
>
> Ё Е
> ё е
Regarding Cyrillic characters mentioned, I did not noticed. But yes, we
have to keep them (see Teodor Sigaev's message below). Furthermore, I
continued my research to see which characters was not handled yet, and
they are potentially multiple and it is not always clear whether they
should be. In particular, I found several characters in “Letterlike
Symbols” Unicode Block (U+2100 to U+214F) who were absent from the
transliterator (℃, ℉, etc.). So I changed the script to handle special
cases, and I added those I just mentioned (you will find attached the
new version of the script and the generated output for convenience).
>
> Should we extend the composition data analysis to make these remaining
> special cases go away? We'd need a definition of is_plain_letter that
> returns True for 0415 so that 0401 can be recognised as 0415 + 0308.
> Depending on how you do that, you could sweep in some more Cyrillic
> mappings and a ton of stuff from other scripts that have precomposed
> diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need
> someone with knowledge of relevant languages to sign off on the result
> -- so it might make sense to stick to a definition that includes just
> Latin and Cyrillic for now.
>
> (Otherwise it might be tempting to use *only* the transliterator
> approach, but CLDR doesn't seem to have appropriate transliterator
> files for other scripts. They have for example Cyrillic -> Latin, but
> we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin ->
> ASCII.)
>
Indeed, I added some special cases, but I doubt very much that it is
exhaustive. It would be good to find a way to avoid these cases.
Regarding the various solutions proposed, it may be possible to opt for
a hybrid one. For example, extend the analysis of the composition for
blocks when relevant (some characters mentioned above show that some are
not in transliterators), or use a transliterator when it's more
convenient (perhaps for Cyrillic, etc.).

You also right about to the fact that sometimes we must think for some
languages (and so we'd need someone with knowledge of these languages),
this is also true for some blocks for which we must decide whether to
include certain characters makes sense or not. I think, notably, about
the extended Latin blocks (Latin Extended-A, B, Additional, C, D, etc.)
which are yet ignored.

11/02/2016 16:36, Teodor Sigaev wrote:
>> I don't think this alone will completely get rid of the hardcoded
>> special cases though, because we have these two mappings which look
>> like Latin but are in fact Cyrillic and I assume we need to keep them:
>>
>> Ё Е
>> ё е
>>
> As a native Russian speaker I can explain why we need to keep this two
> rules.
> 'Ё' letter is not 'E' with some accent/diacritic sign, it is a
> separate letter in russian alphabet. But a lot of newpapers, magazines
> and even books use 'Е' instead of 'Ё' to simplify printing house work.
> Any Russian speaker doesn't make a mistake while reading because 'Ё'
> isn't frequent and anybody remembers the right pronounce. Also, on
> russian keyboard 'Ё' placed in inconvenient place (key with ` or ~),
> so, many russian writer use 'Е' instead of it to increase typing speed.
>
> Pls, do not remove at least this special case.
>
This case is now managed as a special case in the new version (see above).

Léonard Benedetti

Attachment Content-Type Size
contrib_unaccent_generate_unaccent_rules.py text/x-python 9.0 KB
unaccent.rules text/plain 6.2 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Léonard Benedetti 2016-02-11 21:13:47 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message Vitaly Burovoy 2016-02-11 20:53:24 Re: SQL parser ubnormal behaviour