Re: BUG #13440: unaccent does not remove all diacritics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-16 22:01:14
Message-ID: 1790.1434492074@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> It looks like Romanian also has s with comma. Perhaps we should have
>> all these characters:
>>
>> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
>> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
>> 702

> Here is an unaccent.rules file that maps those 702 characters from
> Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
> ..." to their base letter, plus 14 extra cases to match the existing
> unaccent.rules file. If you sort and diff this and the existing file,
> you can see that this file only adds new lines. Also, here is the
> script I used to build it from UnicodeData.txt.

Hm. The "extra cases" are pretty disturbing, because some of them sure
look like bugs; which makes me wonder how closely the unaccent.rules
file was vetted to begin with. For those following along at home,
here are Thomas' extra cases, annotated by me with the Unicode file's
description of each source character:

print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
print_record(0x00e6, "a") # LATIN SMALL LETTER AE
print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
print_record(0x0138, "k") # LATIN SMALL LETTER KRA
print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
print_record(0x014b, "n") # LATIN SMALL LETTER ENG
print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO

I'm really dubious that we should be translating those ligatures at
all (since the standard file is only advertised to do "unaccenting"),
and if we do translate them, shouldn't they convert to AE, ae, etc?

Also unclear why we're dealing with KRA and ENG but not any of the
other marginal letters that Unicode labels as LATIN (what the heck
is an "AFRICAN D", for instance?)

Also, while my German is nearly nonexistent, I had the idea that sharp-S
to "S" would be considered a case-folding transformation not an accent
removal. Comments from German speakers welcome of course.

Likewise dubious about those Cyrillic entries, although I suppose
Teodor probably had good reasons for including them.

On the other side of the coin, I think Thomas' regex might have swept up a
bit too much. I did this to see what sort of decorations were described:

$ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq -c
34 ACUTE
2 ACUTE AND DOT ABOVE
4 BAR
2 BELT
12 BREVE
2 BREVE AND ACUTE
2 BREVE AND DOT BELOW
2 BREVE AND GRAVE
2 BREVE AND HOOK ABOVE
2 BREVE AND TILDE
2 BREVE BELOW
34 CARON
2 CARON AND DOT ABOVE
22 CEDILLA
2 CEDILLA AND ACUTE
2 CEDILLA AND BREVE
26 CIRCUMFLEX
6 CIRCUMFLEX AND ACUTE
6 CIRCUMFLEX AND DOT BELOW
6 CIRCUMFLEX AND GRAVE
6 CIRCUMFLEX AND HOOK ABOVE
6 CIRCUMFLEX AND TILDE
12 CIRCUMFLEX BELOW
4 COMMA BELOW
4 CROSSED-TAIL
7 CURL
8 DESCENDER
19 DIAERESIS
4 DIAERESIS AND ACUTE
2 DIAERESIS AND CARON
2 DIAERESIS AND GRAVE
6 DIAERESIS AND MACRON
2 DIAERESIS BELOW
8 DIAGONAL STROKE
39 DOT ABOVE
4 DOT ABOVE AND MACRON
38 DOT BELOW
2 DOT BELOW AND DOT ABOVE
4 DOT BELOW AND MACRON
4 DOUBLE ACUTE
2 DOUBLE BAR
12 DOUBLE GRAVE
1 DOUBLE MIDDLE TILDE
1 FISHHOOK
1 FISHHOOK AND MIDDLE TILDE
5 FLOURISH
16 GRAVE
2 HIGH STROKE
30 HOOK
12 HOOK ABOVE
1 HOOK AND TAIL
1 HOOK TAIL
4 HORN
4 HORN AND ACUTE
4 HORN AND DOT BELOW
4 HORN AND GRAVE
4 HORN AND HOOK ABOVE
4 HORN AND TILDE
12 INVERTED BREVE
1 INVERTED LAZY S
3 LEFT HOOK
17 LINE BELOW
1 LONG LEFT LEG
1 LONG LEFT LEG AND LOW RIGHT RING
1 LONG LEG
2 LONG RIGHT LEG
2 LONG STROKE OVERLAY
4 LOOP
1 LOW RIGHT RING
1 LOW RING INSIDE
14 MACRON
4 MACRON AND ACUTE
2 MACRON AND DIAERESIS
4 MACRON AND GRAVE
2 MIDDLE DOT
1 MIDDLE RING
13 MIDDLE TILDE
1 NOTCH
10 OBLIQUE STROKE
10 OGONEK
2 OGONEK AND MACRON
17 PALATAL HOOK
9 RETROFLEX HOOK
1 RETROFLEX HOOK AND BELT
1 RIGHT HALF RING
1 RIGHT HOOK
6 RING ABOVE
2 RING ABOVE AND ACUTE
2 RING BELOW
1 SERIF
2 SHORT RIGHT LEG
2 SMALL LETTER J
1 SMALL LETTER Z
2 SQUIRREL TAIL
36 STROKE
2 STROKE AND ACUTE
2 STROKE AND DIAGONAL STROKE
4 STROKE THROUGH DESCENDER
4 SWASH TAIL
3 TAIL
16 TILDE
4 TILDE AND ACUTE
2 TILDE AND DIAERESIS
2 TILDE AND MACRON
6 TILDE BELOW
4 TOPBAR

Do we really need to expand the rule list fivefold to get rid of things
like FISHHOOK and SQUIRREL TAIL? Is removing those sorts of things even
legitimately "unaccenting"? I dunno, but I think it would be good to
have some consensus about what we want this file to do. I'm not sure
that we should be basing the transformation on minor phrasing details
in the Unicode data file.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2015-06-16 22:08:20 Re: BUG #13444: psql can't recover a pg_dump.
Previous Message Marko Tiikkaja 2015-06-16 21:28:07 Re: BUG #13444: psql can't recover a pg_dump.