Re: BUG #13440: unaccent does not remove all diacritics

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2015-06-18 21:17:22
Message-ID: 20150618211722.GJ133018@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
> > On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I'm really dubious that we should be translating those ligatures at
> >> all (since the standard file is only advertised to do "unaccenting"),
> >> and if we do translate them, shouldn't they convert to AE, ae, etc?
>
> > Perhaps these conversions are intended only for comparisons, full text
> > indexing etc but not showing the converted text to a user, in which
> > case it doesn't matter too much if the conversions are a bit weird
> > (œuf and oeuf are interchangeable in French, but euf is nonsense).
> > But can we actually change them? That could cause difficulty for
> > users with existing unaccented data stored/indexed... But I suppose
> > even adding new mappings could cause problems.
>
> Yeah, if we do anything other than adding new mappings, I suspect that
> part could not be back-patched. Maybe adding new mappings shouldn't
> be back-patched either, though it seems relatively safe to me.

To me, conceptually what unaccent does is turn whatever junk you have
into a very basic common alphabet (ascii); then it's very easy to do
full text searches without having to worry about what accents the people
did or did not use in their searches. If we say "okay, but that funny
char is not an accent so let's leave it alone" then the charter doesn't
sound so useful to me.

The cases I care about are okay anyway, because all the funny chars in
spanish are already covered; and maybe German people always enter their
queries using the funny ss thing I can't even write, and then this is
not a problem for them.

Regarding back-patching unaccent.rules changes as discussed downthread,
I think it's okay to simply document that any indexes using the module
should be reindexed immediately after upgrading to that minor version.
The consequence of not doing so is not *that* serious anyway. But then,
since I'm not actually affected in any way, I'm not strongly holding
this position either.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message 德哥 2015-06-18 23:43:27 Re: BUG #13453: PostgreSQL 9.5dev pgbench exponential distribution bug? (when threshold is small)
Previous Message Tom Lane 2015-06-18 20:48:44 Re: BUG #13440: unaccent does not remove all diacritics