Re: Patch for collation using ICU

From: "John Hansen" <john(at)geeknet(dot)com(dot)au>
To: "Palle Girgensohn" <girgen(at)pingpong(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Subject: Re: Patch for collation using ICU
Date: 2005-03-26 02:59:19
Message-ID: 5066E5A966339E42AA04BA10BA706AE5627D@rodrick.geeknet.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Palle Girgensohn [mailto:girgen(at)pingpong(dot)net]
> Sent: Saturday, March 26, 2005 1:10 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Cc: John Hansen; Andrew Dunstan
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
> <girgen(at)pingpong(dot)net> wrote:
>
> > Hi!
> >
> > I've put together a patch for using IBM's ICU package for collation.
> >
> > If your OS does not have full support for collation ur
> > uppercase/lowercase in multibyte locales, this might be
> useful. If you
> > are using a multibyte character encoding in your database and want
> > collation, i.e. order by, and also lower(), upper() and
> initcap() to
> > work properly, this patch will do just that.
> >
> > This patch is needed for FreeBSD, since this OS has no support for
> > collation of for example unicode locales (that is,
> wcscoll(3) does not
> > do what you expect if you set LC_ALL=sv_SE.UTF-8, for
> example). AFAIK
> > the patch is *not* necessary for Linux, although IBM claims ICU
> > collation to be about twice as fast as glibc for simple
> western locales.
> >
> > It adds a configure switch, `--with-icu', which will set up
> the code
> > to use ICU instead of wchar_t and wcscoll.
> >
> > This has been tested only on FreeBSD-4.11 &
> FreeBSD-5-stable, where it
> > seems to run well. I've not had the time to do any comparative
> > performance tests yet, but it seems it is at least not slower than
> > using
> > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
> >
> > I'd be delighted if some more experienced postgresql hackers would
> > review this stuff. The patch is pretty compact, so it's
> fast reading
> > :) I'm planning to add this patch as an option (tagged
> > "experimental") to FreeBSD's postgresql port. Any ideas
> about whether
> > this is a good idea or not?
> >
> > Any thoughts or ideas are welcome!
> >
> > Cheers,
> > Palle
> >
> > Patch at:
> >
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
> > 4.d
> > iff>
> >
> > ICU at sourceforge: <http://icu.sf.net/>
>
>
> Hi!
>
> There's a new patch to fix some reported problems.
>
> <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-26.diff>
>
> This version uses the DatabaseEncoding and sets the ICU
> encoding at the same time. I had to create a conversion table
> from PostgreSQL's own, somewhat odd and non-standard, names
> of encodings, into the prefered IANA names. On or two of the
> more odd ones might be slightly incorrect, hopefully not too
> far off anyway?
>
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might
> break the SQL standard (?), but sure is nice :)

This would mean that indexes are also case insensitive right?
Which makes it a Bad Thing(tm).

> - When the database is initialized using the C locale,
> upper() and lower() normally does not work at all for
> non-ASCII characters even if the database's encoding is say
> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
> and this is probably correct since the locale is still `C', I
> believe?). The ICU patch changes nothing for the LATIN1 case,
> since it does not act on single byte encodings, but for the
> UNICODE representation, it works and does what I expect it
> to, namely upper() and lower() neatly
> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
> -> 'åäö'.
> This is a good thing, although I'm surprised that upper/lower
> is dragged along with the LC_COLLATE fixation at initdb. I
> never run initdb in the C locale, but only now do I realize
> how broken that really is if you need to store anything else
> than English :-)

That is what I would have expected. However, it probably won't work for the more exotic cases, like turkish I, which depends on the locale.

>
> I'd be delighted to get more feedback about this stuff.
>
> Thanks,
> Palle
>
>
>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Hansen 2005-03-26 03:18:56 Re: Patch for collation using ICU
Previous Message Tom Lane 2005-03-26 02:20:08 Re: minor windows & cygwin regression failures on stable branch