Re: Patch for collation using ICU

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-07 13:52:59
Message-ID: 200505071352.j47DqxK28575@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Palle Girgensohn wrote:
> >> Also, apparently, ICU is installed by default in many linux
> >> distributions, and usually it is version 2.8. Some linux users have
> >> asked me if there are plans for a patch that works with ICU 2.8. That's
> >> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be
> >> the stable version, older versions are hard to find on their sites, but
> >> most linux distributers seem to consider it too bleeding edge, even
> >> gentoo. I don't know why they don't agree.
> >
> > Good point. Why would linux folks need ICU? Doesn't their OS support
> > encodings natively? I am particularly excited about this for OSs that
> > don't have such encodings, like UTF8 support for Win32.
> >
> > Because ICU will not be used unless enabled by configure, it seems we
> > are fine with only supporting the newest version. Do Linux users need
> > to use ICU for any reason?
>
>
> There are corner cases where it is impossible to upper/lowercase one
> character at the time. for example:
>
> -- without ICU
> select upper('E?er');
> upper
> -------
> E?ER
> (1 row)
>
> -- with ICU
> select upper('E?er');
> upper
> -------
> ESSER
> (1 rad)
>
> This is because in the standard postgres implementation, upper/lower is
> done one character at the time. A proper upper/lower cannot do it that way.
> Other known example is in Turkish, where an ? (?) should look different
> whether it is an initial letter or not. This fails in standard postgresql
> for all platforms.

Uh, where do you see that? Our code has:

workspace = texttowcs(string);

for (i = 0; workspace[i] != 0; i++)
workspace[i] = towupper(workspace[i]);

result = wcstotext(workspace, i);

> >> Also, in the latest patch, I also added checks and logging for *every*
> >> status returned from ICU. I hope this will help debugging on debian,
> >> where previous version didn't work. That excessive status checking is
> >> hardly be necessary once the stuff is better tested.
> >>
> >> I think the string copying and heap/palloc choices stands for most of
> >> the code bloat, together with the excessive status checking and logging.
> >
> > OK, move that into some common functions and I think it will be better.
>
> Best way for upper/lower/initcap is probably to use a function pointer...
> uhh...

Uh, I don't think so. Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that. I can probably do it if you want.

> >> > Why do you need to add a mapping of encoding names from iana to our
> >> > names?
> >>
> >> This was already answered by John Hansen... There's an old thread here
> >> about the choice of the name "UNICODE" to describe an encoding, which it
> >> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
> >> by postgresql, that would have been a better name... Similarly for most
> >> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
> >> PostgreSQL can't provide it, so a mapping table is required.
> >
> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that
> > help?
>
> I'm aware of that. It might help for unicode, but there are a bunch of
> other encodings. IANA has decided that utf-8 has *no* aliases, hence only
> utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
> fogiving, I don't remember/know, but I think we need the mappings,
> unfortunately.

OK. I guess I am just confused why the native implementations are OK.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-05-07 14:06:43 Re: Patch for collation using ICU
Previous Message John Hansen 2005-05-07 13:49:01 Re: Patch for collation using ICU