Re: Windows and locales and UTF-8 (oh my)

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Windows and locales and UTF-8 (oh my)
Date: 2007-10-15 11:40:10
Message-ID: 20071015114010.GD5806@svr2.hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 15, 2007 at 01:26:00PM +0200, Magnus Hagander wrote:
> On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
> > On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> > > I am thinking that Dave's discovery explains some previously unsolved
> > > bug reports, such as
> > > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> > > If Windows returns LC_CTYPE=C in a situation like this, then
> > > the various single-byte-charset optimization paths that are enabled by
> > > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> > > upper()/lower() and other places. ISTM we had better hack
> > > lc_ctype_is_c() so that on Windows (only), if the database encoding
> > > is UTF-8 then it returns FALSE regardless of what setlocale says.
> >
> > Yes, I think we a change to that routine.
> >
> > But. What about the case when we actually *have* locale=C and
> > encoding=UTF8. We need to care for that one somehow. Perhaps we should look
> > at LC_COLLATE instead (again, on Windows only. Possibly even only in the
> > windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?
>
> Hmm. Looking more at that, may there be another problem? Looking at
> WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
> will then be "C" - even if the database isn't in C.
>
> But I don't really know when that code is called, or if I'm just looking at
> things wrong. Just starting up and shutting down the database leaves it at
> Swedish_Sweden.1252, not C.
> (1252 is still the wrong encoding specifyer, but it'll work anyway since we
> convert to UTF16)

Gah, got that backwards. Of course it does, because it only returns "C" if
we set to Swedish_Sweden.65001, and we don't *do* that with the patch I
sent in earlier. We set it to Swedish_Sweden, which is a perfectly valid
LC_CTYPE.

And given that, do we even nede to special-case lc_ctype_is_c() at all? If
we never pass in a .65001 locale (which we don't, because it fails)?

//Magnus

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-10-15 14:28:21 Re: [HACKERS] quote_literal with NULL
Previous Message Magnus Hagander 2007-10-15 11:26:00 Re: Windows and locales and UTF-8 (oh my)