Re: BUG #13440: unaccent does not remove all diacritics

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: mike(at)busbud(dot)com, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-15 04:47:01
Message-ID: CAEepm=2b1df83h68tTiuk_xGC-PVmru02+rE2xp6_Hs5q_zHSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Jun 15, 2015 at 5:59 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> mike(at)busbud(dot)com writes:
>> Sorry, I couldn't install the most recent minor release, but I did try this
>> on several different versions. I used Heroku to try a 9.4.3 build, and got
>> the same results
>
>> select 'ț' as input, unaccent('ț') as observed, 't' as expected;
>> input | observed | expected
>> -------+----------+----------
>> ț | ț | t
>> (1 row)
>
> Hm, I do see
>
> ţ t
>
> in unaccent.rules, so the transformation ought to happen. I suspect
> an encoding issue, eg your terminal window is not transmitting characters
> in the encoding Postgres thinks you're using. You did not provide any
> info about server encoding, client encoding, or client LC_xxx environment,
> so it's hard to debug from here.

The one that is in unaccent.rules is apparently t-cedilla, from Gagauz
and Romanian:

https://en.wiktionary.org/wiki/%C5%A3

The one that is referred to above is apparently t-comma, from Livonian
and Romanian, but "[o]ften replaced by Ţ / ţ (t with cedilla),
especially in computing":

https://en.wiktionary.org/wiki/%C8%9B

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2015-06-15 04:50:56 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message Michael Gradek 2015-06-15 04:02:28 Re: BUG #13440: unaccent does not remove all diacritics