Re: case insensitive match in unicode

From: "Mike Rylander" <mrylander(at)gmail(dot)com>
To: SunWuKung <Balazs(dot)Klein(at)axelero(dot)hu>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: case insensitive match in unicode
Date: 2006-04-07 14:41:15
Message-ID: b918cf3d0604070741g2d30dc16iccf065ac539682c7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 4/6/06, SunWuKung <Balazs(dot)Klein(at)axelero(dot)hu> wrote:
> In article <20060327114037(dot)GD30791(at)svana(dot)org>, kleptog(at)svana(dot)org says...
> > On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> > > This sounds like a very interesting concept.
> > > It wouldn't be 'case insensitive' just insensitive.
> > >
> > > The way I imagine it now is a special case of the ~ function.
> > > I create matchgroups in a table and check each character if it is in the
> > > group. If it is I will replace the character with the group in [éÉE],
> > > [oóOÓ??] and do a regexp with that.
> >
> > No need to reinvent the wheel. ICU provides a range of services to deal
> > with this. For example the following filter in ICU:
> >
> > NFD; [:Nonspacing Mark:] Remove; NFC.
> >
> > Will remove all accents from characters. And it works for all Unicode
> > characters. With a bit more thinking you can work with case variations
> > also.
> >
> > There is also a locale-independant case-mapping module there plus
> > various locale specific ones also.
> >
> > http://icu.sourceforge.net/userguide/Transform.html
> > http://icu.sourceforge.net/userguide/caseMappings.html
> > http://icu.sourceforge.net/userguide/normalization.html
> >
> > Have a nice day,
> >
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?

I was looking into creating a Pg function wrapper to some of the ICU
stuff, but, to be perfectly honest, I couldn't find an actual API
reference for ICU.

In any case, you can do this with PL/Perl:

CREATE FUNCTION strip_nonspacing_marks ( text ) RETURNS text AS $func$
use Unicode::Normalize;
use Encode;

my $string = NFD( decode( utf8 => shift() ) );
$string =~ s/\p{Mn}+//ogsm;

return NFC($string);
$func$ LANGUAGE 'plperl' STRICT;

It's untested and won't be as fast as ICU, but it should get the job
done. Hope it helps!

>
> Thanks
> Balázs
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2006-04-07 14:48:33 Re: postmaster going down own its on
Previous Message Martijn van Oosterhout 2006-04-07 14:17:59 Re: postmaster going down own its on