Re: BUG #13440: unaccent does not remove all diacritics

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-23 01:00:43
Message-ID: CAEepm=1KRVinFtuDao4L+qSBh4T4k3z996EwD5-zgytu4Qa5Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Jun 19, 2015 at 2:00 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>>> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> I'm really dubious that we should be translating those ligatures at
>>>> all (since the standard file is only advertised to do "unaccenting"),
>>>> and if we do translate them, shouldn't they convert to AE, ae, etc?
>>
>>> Perhaps these conversions are intended only for comparisons, full text
>>> indexing etc but not showing the converted text to a user, in which
>>> case it doesn't matter too much if the conversions are a bit weird
>>> (œuf and oeuf are interchangeable in French, but euf is nonsense).
>>> But can we actually change them? That could cause difficulty for
>>> users with existing unaccented data stored/indexed... But I suppose
>>> even adding new mappings could cause problems.
>>
>> Yeah, if we do anything other than adding new mappings, I suspect that
>> part could not be back-patched. Maybe adding new mappings shouldn't
>> be back-patched either, though it seems relatively safe to me.
>>
>>> Right, that does seem a little bit weak. Instead of making
>>> assumptions about the format of those names, we could make use of the
>>> precomposed -> composed character mappings in the file. We could look
>>> for characters in the "letters" category where there is decomposition
>>> information (ie combining characters for the individual accents) and
>>> the base character is [a-zA-Z]. See attached. This produces 411
>>> mappings (including the 14 extras). I didn't spend the time to figure
>>> out which 300 odd characters were dropped but I noticed that our
>>> Romanian characters of interest are definitely in.
>>
>> I took a quick look at this list and it seems fairly sane as far as the
>> automatically-generated items go, except that I see it hits a few
>> LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
>> I'm still quite dubious that that is appropriate; at least, if we do it
>> I think we should be expanding out to the equivalent multi-letter form,
>> not simply taking one of the letters and dropping the rest. Anybody else
>> have an opinion on how to handle ligatures?
>
> Here is a version that optionally expands ligatures if asked to with
> --expand-ligatures.

I looked at this again and noticed a few problems. I've attached a
new version. Here is a summary of the changes compared to what is in
master:

* 6 existing ligatures expanded fully: Æ, æ, IJ, ij, Œ, œ
* 18 new ligatures added: DŽ, Dž, dž, LJ, Lj, lj, NJ, Nj, nj, DZ, Dz, dz, ff, fi, fl, ffi, ffl, st
* ß expanded to ss instead of S
* ʼn expanded to 'n instead of n
* 5 existing characters that involve neither diacritic marks[1] nor
ligatures dropped: ĸ, Ŀ, ŀ, Ŋ, ŋ
* 213 new characters with diacritics added: Ơ, ơ, Ư, ư, Ǎ, ǎ, Ǐ, ǐ, Ǒ,
ǒ, Ǔ, ǔ, Ǧ, ǧ, Ǩ, ǩ, Ǫ, ǫ, ǰ, Ǵ, ǵ, Ǹ, ǹ, Ȁ, ȁ, Ȃ, ȃ, Ȅ, ȅ, Ȇ, ȇ, Ȉ,
ȉ, Ȋ, ȋ, Ȍ, ȍ, Ȏ, ȏ, Ȑ, ȑ, Ȓ, ȓ, Ȕ, ȕ, Ȗ, ȗ, Ș, ș, Ț, ț, Ȟ, ȟ, Ȧ, ȧ,
Ȩ, ȩ, Ȯ, ȯ, Ȳ, ȳ, Ḁ, ḁ, Ḃ, ḃ, Ḅ, ḅ, Ḇ, ḇ, Ḋ, ḋ, Ḍ, ḍ, Ḏ, ḏ, Ḑ, ḑ, Ḓ,
ḓ, Ḙ, ḙ, Ḛ, ḛ, Ḟ, ḟ, Ḡ, ḡ, Ḣ, ḣ, Ḥ, ḥ, Ḧ, ḧ, Ḩ, ḩ, Ḫ, ḫ, Ḭ, ḭ, Ḱ, ḱ,
Ḳ, ḳ, Ḵ, ḵ, Ḷ, ḷ, Ḻ, ḻ, Ḽ, ḽ, Ḿ, ḿ, Ṁ, ṁ, Ṃ, ṃ, Ṅ, ṅ, Ṇ, ṇ, Ṉ, ṉ, Ṋ,
ṋ, Ṕ, ṕ, Ṗ, ṗ, Ṙ, ṙ, Ṛ, ṛ, Ṟ, ṟ, Ṡ, ṡ, Ṣ, ṣ, Ṫ, ṫ, Ṭ, ṭ, Ṯ, ṯ, Ṱ, ṱ,
Ṳ, ṳ, Ṵ, ṵ, Ṷ, ṷ, Ṽ, ṽ, Ṿ, ṿ, Ẁ, ẁ, Ẃ, ẃ, Ẅ, ẅ, Ẇ, ẇ, Ẉ, ẉ, Ẋ, ẋ, Ẍ,
ẍ, Ẏ, ẏ, Ẑ, ẑ, Ẓ, ẓ, Ẕ, ẕ, ẖ, ẗ, ẘ, ẙ, Ạ, ạ, Ả, ả, Ẹ, ẹ, Ẻ, ẻ, Ẽ, ẽ,
Ỉ, ỉ, Ị, ị, Ọ, ọ, Ỏ, ỏ, Ụ, ụ, Ủ, ủ, Ỳ, ỳ, Ỵ, ỵ, Ỷ, ỷ, Ỹ, ỹ

In the previous version I'd missed the LATIN ... WITH STROKE
characters like ø and ł because they aren't treated as diacritics or
ligatures in the Unicode decomposition data (they're just separate
letters, but they have an obvious unadorned ASCII replacement letter
and we already handle these). There may be a case for replacing ø
with oe[2] but that's not what we do now. Can any Danish or Norwegian
speakers comment on this? There are actually 36 characters with names
matching /LATIN (CAPITAL|SMALL) LETTER [A-Z] WITH STROKE/, but I added
only the ones that we already had, namely O, D, H, L and lower case
equivalents. Many of the rest seem to be obscure specialised
characters not used in real languages.

I don't see why we would take out that Cyrillic character: it seems
like a totally legitimate case[3]. Even though it doesn't fit in with
the idea that some might have of unaccent as the
"make-this-into-plain-ASCII" function, there doesn't seem to be any
reason why we shouldn't be able to handle Latin, Cyrillic and (if
someone with the knowledge wants to add them) Greek characters in the
same rule file -- they are non-overlapping, and all have diacritic
marks which can be stripped to give a basic character set. That seems
pretty useful for text search type applications, which is what this
feature is for AFAIK.

[1] That L is combining with punctuation, not a mark, according to
Unicode, and generally doesn't seem to be used in any language (unlike
ʼn/'n which is a common word in Afrikaans)
[2] https://en.wikipedia.org/wiki/%C3%98 'In other languages that do
not have the letter as part of the regular alphabet, or in limited
character sets such as ASCII, ø is frequently replaced with the
two-letter combination "oe".'
[3] https://en.wiktionary.org/wiki/%D1%91 'This letter invariably
bears the word stress. However, the diaeresis is usually not used
outside of dictionaries and children’s books, where the letter is
usually written simply as е.'

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
make_rules_v4.py text/x-python-script 4.8 KB
unaccent.rules application/octet-stream 2.2 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Марк Коренберг 2015-06-23 06:13:04 Re: BUG #13462: Impossible to use COPY FORMAT BINARY in chunks through libpq
Previous Message David G. Johnston 2015-06-22 23:43:30 Re: Incomplete Explain for delete