Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Rough draft for Unicode-aware UPPER()/LOWER()/INITCAP()
Date: 2004-05-13 02:42:26
Message-ID: 2739.1084416146@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I got tired of reading complaints about how upper/lower don't work with
Unicode, so I went and prototyped a solution. The attached code uses
the C99-standard functions mbstowcs and wcstombs to convert to and from
a "wchar_t[]" representation that can be fed to the also-C99 functions
towupper, towlower, etc.

This code will only work if the database is running under an LC_CTYPE
setting that implies the same encoding specified by server_encoding.
However, I don't see that as a fatal objection, because in point of fact
the existing upper/lower code assumes the same thing. When they don't
match, this code may deliver an "invalid multibyte character" error
rather than silently producing a wrong answer, but is that really a step
backward?

Note this patch is *not* meant for application to CVS yet. It's not
autoconfiscated. But if you have a platform that has mbstowcs and
friends, please try it and let me know about any portability gotchas
you see.

Also, as a character-set-impaired American, I'm probably not the best
qualified person to judge whether the patch actually does what's wanted.
It seemed to do the right sorts of conversions in my limited testing,
but does it do what *you* want it to do?

regards, tom lane

PS: the patch works against either 7.4 or CVS tip.

Attachment Content-Type Size
unknown_filename text/plain 5.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christopher Kings-Lynne 2004-05-13 02:43:21 Re: Subtle pg_dump problem...
Previous Message Bruce Momjian 2004-05-13 02:26:03 Re: threads stuff/UnixWare