Re: [GENERAL] trouble with to_char('L')

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GENERAL] trouble with to_char('L')
Date: 2010-04-16 10:52:27
Message-ID: g2o9837222c1004160352n319ac670p647bef30d121e50c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Mon, Mar 22, 2010 at 9:14 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Takahiro Itagaki wrote:
>>
>> Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>
>> > Takahiro Itagaki wrote:
>> > > Since 9.0 has GetPlatformEncoding() for the purpose, we could simplify
>> > > db_encoding_strdup() with the function. Like this:
>> >
>> > OK, I don't have any Win32 people testing this patch so if we want this
>> > fixed for 9.0 someone is going to have to test my patch to see that it
>> > works.  Can you make the adjustments suggested above to my patch and
>> > test it to see that it works so we can apply it for 9.0?
>>
>> Here is a full patch that can be applied cleanly to HEAD.
>> Can anyone test it on Windows?
>>
>> I'm not sure why temporary changes of lc_ctype was required in the
>> original patch. The codes are not included in my patch, but please
>> notice me it is still needed.
>
> Sorry for the delay in replying to you.
>
> I considered your idea of using the existing Postgres encoding
> conversion routines to do the conversion of localenv() strings, but
> found two problems.
>
> First, GetPlatformEncoding() caches its result, so it assumes the
> LC_CTYPE never changes for the server, while fixing this issue actually
> requires us to change LC_CTYPE.  We could avoid the caching but that
> then involves complex table lookups, etc, which seems overly complex:
>
> +       /* convert the string to the database encoding */
> +       pstr = (char *) pg_do_encoding_conversion(
> +                                               (unsigned char *) str, strlen(str),
> +                                               GetPlatformEncoding(), GetDatabaseEncoding());
>
> Second, having our backend routines do the conversion seems wrong
> because it is possible for someone to set LC_MONETARY to an encoding
> that our database does not understand, e.g. UTF16, but one that WIN32
> can convert to a valid encoding.
>
> The reason we are doing all this is because of this updated comment in
> my patch:
>
>        ftp://momjian.us/pub/postgresql/mypatches/pg_locale
>
> +    *  Ideally, monetary and numeric local symbols could be returned in
> +    *  any server encoding.  Unfortunately, the WIN32 API does not allow
> +    *  setlocale() to return values in a codepage/CTYPE that uses more
> +    *  than two bytes per character, like UTF-8:
> +    *
> +    *      http://msdn.microsoft.com/en-us/library/x99tb11d.aspx
> +    *
> +    *  Evidently, LC_CTYPE allows us to control the encoding used
> +    *  for strings returned by localeconv().  The Open Group
> +    *  standard, mentioned at the top of this C file, doesn't
> +    *  explicitly state this.
> +    *
> +    *  Therefore, we set LC_CTYPE to match LC_NUMERIC and
> +    *  LC_MONETARY, call localeconv(), and use mbstowcs() to
> +    *  convert the locale-aware string, e.g. Euro symbol (which
> +    *  is not in UTF-8), to the server encoding.
>
> One new idea would be to set LC_CTYPE to UTF16/widechars unconditionally
> on Win32 and then just convert that always to the server encoding with
> win32_wchar_to_db_encoding(), instead of using the encoding from
> LC_MONETARY to set LC_CTYPE and having to do double-conversion.

So, hugely late, reviving this thread.

Ideally, we should definitely consider doing that. Internally, Windows
will do it in UTF16 anyway. So we're basically doing
UTF16->db->UTF16->UTF8->db or something like that with this patch.

But I'm unsure how that would work. We're talking about the output of
localeconv(), right? I don't see a version of localeconv() that does
wide chars anywhere. (You can't just set LC_CTYPE and use the regular
function - Windows has a separate set of functions for dealing with
UTF16).

Looking at the patch, you're passing "item" to db_encoding_strdup()
but it doesn't seem to be used anywhere. Leftover from previous
experiments, or forgot to use it? Perhaps you intended for it to be in
the error messages?

Also, won't this need special-casing for UTF8? Per comment in
mbutils.c, wcstombs() doesn't work for UTF8 encodings - you need to
use MultiByteToWideChar().

I also note that we have char2wchar() already - we should perhaps just
call that? Or will that use the wrong locale?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message dipti shah 2010-04-16 11:02:16 Re: How to get whether user has ALL permissions on table?
Previous Message Szymon Guz 2010-04-16 10:40:21 Re: Tuple storage overhead

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2010-04-16 11:47:42 Re: testing HS/SR - 1 vs 2 performance
Previous Message Heikki Linnakangas 2010-04-16 09:54:17 Re: Unsafe threading in syslogger on Windows