Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: hugh(at)whtc(dot)ca,pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-13 13:19:51
Message-ID: 769f5b7c-42c6-435a-a062-a728891b7d81@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

PG Bug reporting form wrote:

> Apparently Unicode has two ways of accenting a character: as a separate code
> point, which represents the base character and the accent, or as a
> "combining diacritical mark"
> (https://en.wikipedia.org/wiki/Combining_Diacritical_Marks)

Yes. See also https://en.wikipedia.org/wiki/Unicode_equivalence

In general, PostgreSQL leaves it to applications to normalize
Unicode strings so that they are all in the same canonical form,
either composed or decomposed.

> the mark applies itself to the preceding character. For example, A
> followed by U+0300 displays À. However, unaccent is not removing
> these accents.

Short of having the input normalized by the application, ISTM that the
best solution would be to provide functions to do it in Postgres, so
you'd just write for example:
unaccent(unicode_NFC(string))

Otherwise unaccent.rules can be customized. You may add replacements
for letter+diacritical sequences that are missing for the languages
you have to deal with. But doing it in general for all diacriticals
multiplied by all base characters seems unrealistic.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2018-12-13 15:05:42 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message PG Bug reporting form 2018-12-13 12:25:50 BUG #15550: I cant connect psql through pgadmin

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2018-12-13 13:35:23 Re: Add timeline to partial WAL segments
Previous Message Andrew Dunstan 2018-12-13 13:14:55 Re: alternative to PG_CATCH