Re: BUG #13440: unaccent does not remove all diacritics

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-18 19:30:46
Message-ID: 4186.1434655846@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I'm really dubious that we should be translating those ligatures at
>> all (since the standard file is only advertised to do "unaccenting"),
>> and if we do translate them, shouldn't they convert to AE, ae, etc?

> Perhaps these conversions are intended only for comparisons, full text
> indexing etc but not showing the converted text to a user, in which
> case it doesn't matter too much if the conversions are a bit weird
> (uf and oeuf are interchangeable in French, but euf is nonsense).
> But can we actually change them? That could cause difficulty for
> users with existing unaccented data stored/indexed... But I suppose
> even adding new mappings could cause problems.

Yeah, if we do anything other than adding new mappings, I suspect that
part could not be back-patched. Maybe adding new mappings shouldn't
be back-patched either, though it seems relatively safe to me.

> Right, that does seem a little bit weak. Instead of making
> assumptions about the format of those names, we could make use of the
> precomposed -> composed character mappings in the file. We could look
> for characters in the "letters" category where there is decomposition
> information (ie combining characters for the individual accents) and
> the base character is [a-zA-Z]. See attached. This produces 411
> mappings (including the 14 extras). I didn't spend the time to figure
> out which 300 odd characters were dropped but I noticed that our
> Romanian characters of interest are definitely in.

I took a quick look at this list and it seems fairly sane as far as the
automatically-generated items go, except that I see it hits a few
LIGATURE cases (including the existing ij cases, but also fi fl and ffl).
I'm still quite dubious that that is appropriate; at least, if we do it
I think we should be expanding out to the equivalent multi-letter form,
not simply taking one of the letters and dropping the rest. Anybody else
have an opinion on how to handle ligatures?

The manually added special cases don't look any saner than they did
before :-(. Anybody have an objection to removing those (except maybe
dotless i) in HEAD?

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2015-06-18 20:21:35 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message chris+postgresql 2015-06-18 16:58:27 BUG #13454: Embedded python can stop WAL streaming and hot standby mode