Re: BUG #15548: Unaccent does not remove combining diacritical characters

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15548: Unaccent does not remove combining diacritical characters
Date: 2018-12-14 22:42:05
Message-ID: CAAhbUMNqJXTN+_vYdi5L4CLjoq9OCG29V597RKrCQ7xKsCAejA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-bugs pgsql-hackers

I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.

I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.

Please let me know if you see anything I need to change.

Best wishes,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh(at)whtc(dot)ca
c: +01-416-994-7957
w: www.whtc.ca

On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:

>
>
> On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel(at)manitou-mail(dot)org wrote:
>
>> Tom Lane wrote:
>>
>> > Hm, I thought the OP's proposal was just to make unaccent drop
>> > combining diacriticals independently of context, which'd avoid the
>> > combinatorial-growth problem.
>>
>
> That's what I was thinking. Given that the accent is separate from the
> characters, simply dropping it should result in the correct unaccented
> character.
>
>>
>> In that case, this could be achieved by simply appending the
>> diacriticals themselves to unaccent.rules, since replacement of a
>> string by an empty string is already supported as a rule.
>> It doesn't seem like the current file has any of these, but from
>> https://www.postgresql.org/docs/11/unaccent.html :
>>
>> "Alternatively, if only one character is given on a line, instances
>> of that character are deleted; this is useful in languages where
>> accents are represented by separate characters"
>>
>
> Yes, I had read that in the docs, and that's the approach I planned to
> take. I'll go ahead and develop a patch, then.
>
> Best wishes,
> Hugh
>
>>

Attachment Content-Type Size
remove-combining-diacritical-accents-in-unaccent.rules.patch text/x-patch 2.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2018-12-14 22:50:03 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Jean-Marc Lessard 2018-12-14 21:57:41 RE: BUG #15553: "ERROR: cache lookup failed for type 2" with a function the first time it run.

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-14 22:50:03 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Robert Haas 2018-12-14 22:24:34 Re: 'infinity'::Interval should be added