Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, thomas(dot)munro(at)enterprisedb(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-15 21:03:33
Message-ID: CAAhbUMMmXnj0YSD+fr5hSqeC+D6PAG+0kXJwMMhK2DCdwQVoxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Sat, 15 Dec 2018 at 14:05, Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:

> On Sat, 15 Dec 2018 at 13:44, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Hm. Something funny is going on here. When I fetch the two reference
>> files from the URLs cited in the script, and do
>>
>
>> python2 generate_unaccent_rules.py --unicode-data-file UnicodeData.txt
>> --latin-ascii-file Latin-ASCII.xml >newrules
>>
>> I get something that's bit-for-bit the same as what's in unaccent.rules.
>> So there's clearly a platform difference between here and there.
>>
>> I'm using Python 2.6.6, which is what ships with RHEL6; have not tried
>> it on anything newer.
>>
> Well, that's embarrassing. When I looked I couldn't see anything that
> looked platform specific. I'm on Python 2.7.6, which shipped with Mint 17.
> We use other versions of 2.7 on our production platforms. I'll take another
> look, and check the URLs I am using.
>

The problem is that I downloaded the latest version of the Latin-ASCII
transliteration file (r34 rather than the r28 specified in the URL). Over 3
years ago (in r29, of course) they changed the file format (
https://unicode.org/cldr/trac/ticket/5873) so that
parse_cldr_latin_ascii_transliterator loads an empty rules set. I'd be
happy to either a) support both formats, or b), support just the newest and
update the URL. Option b) is cleaner, and I can't imagine why anyone would
want to use an older rule set (then again, struggling with Unicode always
makes my head hurt; I am not an expert on it). Thoughts?

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2018-12-15 21:20:11 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Hugh Ranalli 2018-12-15 19:05:07 Re: BUG #15548: Unaccent does not remove combining diacritical characters

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-15 21:20:11 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Vijaykumar Jain 2018-12-15 20:13:56 simple query on why a merge join plan got selected