Re: BUG #13440: unaccent does not remove all diacritics

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-17 01:25:53
Message-ID: CAEepm=2yw0so0ke8ZRy-qWOCrPRC2Ts0cs_6O2Zudkg=R+sR9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> Here is an unaccent.rules file that maps those 702 characters from
>> Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
>> ..." to their base letter, plus 14 extra cases to match the existing
>> unaccent.rules file. If you sort and diff this and the existing file,
>> you can see that this file only adds new lines. Also, here is the
>> script I used to build it from UnicodeData.txt.
>
> Hm. The "extra cases" are pretty disturbing, because some of them sure
> look like bugs; which makes me wonder how closely the unaccent.rules
> file was vetted to begin with. For those following along at home,
> here are Thomas' extra cases, annotated by me with the Unicode file's
> description of each source character:
>
> print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE
> print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S
> print_record(0x00e6, "a") # LATIN SMALL LETTER AE
> print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I
> print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ
> print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ
> print_record(0x0138, "k") # LATIN SMALL LETTER KRA
> print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
> print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG
> print_record(0x014b, "n") # LATIN SMALL LETTER ENG
> print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE
> print_record(0x0153, "e") # LATIN SMALL LIGATURE OE
> print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO
> print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO
>
> I'm really dubious that we should be translating those ligatures at
> all (since the standard file is only advertised to do "unaccenting"),
> and if we do translate them, shouldn't they convert to AE, ae, etc?

Perhaps these conversions are intended only for comparisons, full text
indexing etc but not showing the converted text to a user, in which
case it doesn't matter too much if the conversions are a bit weird
(œuf and oeuf are interchangeable in French, but euf is nonsense).
But can we actually change them? That could cause difficulty for
users with existing unaccented data stored/indexed... But I suppose
even adding new mappings could cause problems.

> Also unclear why we're dealing with KRA and ENG but not any of the
> other marginal letters that Unicode labels as LATIN (what the heck
> is an "AFRICAN D", for instance?)
>
> Also, while my German is nearly nonexistent, I had the idea that sharp-S
> to "S" would be considered a case-folding transformation not an accent
> removal. Comments from German speakers welcome of course.
>
> Likewise dubious about those Cyrillic entries, although I suppose
> Teodor probably had good reasons for including them.
>
> On the other side of the coin, I think Thomas' regex might have swept up a
> bit too much. I did this to see what sort of decorations were described:
>
> $ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq -c
> 34 ACUTE
> ...snip...
> 4 TOPBAR
>
> Do we really need to expand the rule list fivefold to get rid of things
> like FISHHOOK and SQUIRREL TAIL? Is removing those sorts of things even
> legitimately "unaccenting"? I dunno, but I think it would be good to
> have some consensus about what we want this file to do. I'm not sure
> that we should be basing the transformation on minor phrasing details
> in the Unicode data file.

Right, that does seem a little bit weak. Instead of making
assumptions about the format of those names, we could make use of the
precomposed -> composed character mappings in the file. We could look
for characters in the "letters" category where there is decomposition
information (ie combining characters for the individual accents) and
the base character is [a-zA-Z]. See attached. This produces 411
mappings (including the 14 extras). I didn't spend the time to figure
out which 300 odd characters were dropped but I noticed that our
Romanian characters of interest are definitely in.

(There is a separate can of worms here about whether to deal with
decomposed text...)

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
make_rules_decompose.py text/x-python-script 2.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message digoal 2015-06-17 02:17:07 BUG #13449: Auto type cast (int -> numeric) non-reasonable, will case performance problem
Previous Message Sameer Kumar 2015-06-17 01:17:06 Re: pg_xlog on a hot_stanby slave