Quick Links

Windows and locales and UTF-8 (oh my)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Windows and locales and UTF-8 (oh my)
Date:	2007-10-06 17:53:31
Message-ID:	26692.1191693211@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:

------- Forwarded Messages

Date: Fri, 05 Oct 2007 20:54:04 +0100
From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [CORE] 8.3beta1 Available ...

Dave Page wrote:
> Some further info on that - utf-8 on Windows is actually a
> pseudo-codepage (65001) which doesn't have NLS files, hence why we have
> to convert to utf-16 before sorting. Perhaps the utf-8/65001 name
> difference is the problem here. I'll knock up a quick test program when
> the kids have gone to bed.

So, my test prog (below) returns the following:

Dave(at)SNAKE:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001

So everything other than LC_CTYPE is acceptable in UTF-8 on Windows -
and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 ->
UTF-16 conversions internally.

Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?

Regards, Dave.

#include <locale.h>

main (int argc, char *argv[])
{
char *lc;

if (argc > 1)
setlocale(LC_ALL, argv[1]);

lc = setlocale(LC_ALL, NULL);
printf("%s\n", lc);
}

------- Message 2

Date: Fri, 05 Oct 2007 23:32:36 +0100
From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [CORE] 8.3beta1 Available ...

Tom Lane wrote:
> Dave Page <dpage(at)postgresql(dot)org> writes:
>> So, my test prog (below) returns the following:
>
>> Dave(at)SNAKE:~$ ./setlc "English_United Kingdom.65001"
>> LC_COLLATE=English_United
>> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
>> Kingdom.65001;LC_NUMERIC=English_United
>> Kingdom.65001;LC_TIME=English_United Kingdom.65001
>
> That's just frickin' weird ... and a bit scary. There is a fair amount
> of code in PG that checks for lc_ctype_is_c and does things differently;
> one wonders if that isn't going to get misled by this behavior. (Hmm,
> maybe this explains some of the "upper/lower doesn't work" reports we've
> been getting??) Are you sure all variants of Windows act that way?

All the ones we support afaict.

>> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
>
> Is there something in Windows that constrains them to be all the same?
> If not this proposal seems just plain wrong :-( But in any case I'd
> feel more comfortable having it look at LC_COLLATE.

They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.

As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.

------- End of Forwarded Messages

I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places. ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.

That still leaves me with a boatload of questions, though. If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?

Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens? If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?

One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name. That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.

Comments? I don't have a Windows development environment so I'm not
in a position to take the lead on testing/fixing this sort of stuff.

regards, tom lane

Responses

Locales and Encodings at 2007-10-12 11:24:34 from Gregory Stark
Re: Windows and locales and UTF-8 (oh my) at 2007-10-15 09:09:54 from Magnus Hagander

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephan Szabo	2007-10-06 18:19:47	Re: Polymorphic arguments and composite types
Previous Message	Simon Riggs	2007-10-06 17:48:47	Re: Polymorphic arguments and composite types