Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hugh(at)whtc(dot)ca, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2019-12-03 09:01:57
Message-ID: CA+hUKG+OG4bkwe6hn0yEBq2eY=HKuy9D_z2UgXeKjbrav7db5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Dec 3, 2019 at 9:57 PM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Sun, Dec 16, 2018 at 8:20 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
> > > The problem is that I downloaded the latest version of the Latin-ASCII
> > > transliteration file (r34 rather than the r28 specified in the URL). Over 3
> > > years ago (in r29, of course) they changed the file format (
> > > https://unicode.org/cldr/trac/ticket/5873) so that
> > > parse_cldr_latin_ascii_transliterator loads an empty rules set.
> >
> > Ah-hah.
> >
> > > I'd be
> > > happy to either a) support both formats, or b), support just the newest and
> > > update the URL. Option b) is cleaner, and I can't imagine why anyone would
> > > want to use an older rule set (then again, struggling with Unicode always
> > > makes my head hurt; I am not an expert on it). Thoughts?
> >
> > (b) seems sufficient to me, but perhaps someone else has a different
> > opinion.
> >
> > Whichever we do, I think it should be a separate patch from the feature
> > addition for combining diacriticals, just to keep the commit history
> > clear.
>
> +1 for updating to the latest file from time to time. After
> http://unicode.org/cldr/trac/ticket/11383 makes it into a new release,
> our special_cases() function will have just the two Cyrillic
> characters, which should almost certainly be handled by adding
> Cyrillic to the ranges we handle via the usual code path, and DEGREE
> CELSIUS and DEGREE FAHRENHEIT. Those degree signs could possibly be
> extracted from Unicode.txt (or we could just forget about them), and
> then we could drop special_cases().

Aha, CLDR 36 included that change, so when we update we can drop a special case.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Etsuro Fujita 2019-12-03 11:53:54 Re: BUG #16139: Assertion fails on INSERT into a postgres_fdw' table with two AFTER INSERT triggers
Previous Message Thomas Munro 2019-12-02 22:52:31 Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-12-03 09:03:54 Re: [HACKERS] Block level parallel vacuum
Previous Message Julien Rouhaud 2019-12-03 08:38:56 Re: Collation versioning