Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: hugh(at)whtc(dot)ca
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-18 04:05:00
Message-ID: CAEepm=0qb_nx-f8cACS1=1NdmCj-3D9zXFU+RJHsFbZEztcqjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:
> On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
>> > I've attached two patches, one to update generate_unaccent_rules.py, and
>> > another that updates unaccent.rules from the v34 transliteration file.
>>
>> I think you forgot the patches?
>
>
> Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.

+ʹ '
+ʺ "
+ʻ '
+ʼ '
+ʽ '
+˂ <
+˃ >
+˄ ^
+ˆ ^
+ˈ '
+ˋ `
+ː :
+˖ +
+˗ -
+˜ ~

I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:

select unaccent('un café crème s''il vous plaît');

It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:

$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a ai..t');..

(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)

[1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
x.sql application/octet-stream 58 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2018-12-18 04:10:25 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Amit Langote 2018-12-18 03:24:54 Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-12-18 04:10:25 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Tom Lane 2018-12-18 02:37:01 Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's