Re: TM format can mix encodings in to_char()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: TM format can mix encodings in to_char()
Date: 2019-04-19 16:47:39
Message-ID: 15600.1555692459@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo(dot)santamaria(at)gmail(dot)com> writes:
> The problem is that the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5),
> while the test runs in UTF8. So the following code will raise an error:

> SET lc_time TO 'tr_TR';
> SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');
> ERROR: invalid byte sequence for encoding "UTF8": 0xde 0x75

Ugh.

> The problem seems to be in the code touched in the attached patch.

Hmm. I'd always imagined that the way that libc works is that LC_CTYPE
determines the encoding (codeset) it's using across the board, so that
functions like strftime would deliver data in that encoding. That's
mainly based on the observation that nl_langinfo(CODESET) is specified
to depend on LC_CTYPE, and it would be monumentally stupid for any libc
functions to be operating according to a codeset that there's no way to
discover.

However, your example shows that at least glibc is indeed
monumentally stupid about this :-(.

But ... perhaps other implementations are not so silly? I went
looking into the POSIX spec to see if it says anything about this,
and discovered (in Base Definitions section 7, Locale):

If different character sets are used by the locale categories, the
results achieved by an application utilizing these categories are
undefined. Likewise, if different codesets are used for the data being
processed by interfaces whose behavior is dependent on the current
locale, or the codeset is different from the codeset assumed when the
locale was created, the result is also undefined.

"Undefined" is a term of art here: it means the library can misbehave
arbitrarily badly, up to and including abort() or halt-and-catch-fire.
We do *not* want to be invoking undefined behavior, even if particular
implementations seem to behave sanely. Your proposed patch isn't
getting us out of that, and what it is doing instead is embedding an
assumption that the implementation handles this in a particular way.

So what I'm thinking really needs to be done here is to force it to work
according to the LC_CTYPE-determines-the-codeset-for-everything model.
Note that that model is embedded into PG in quite a few ways besides the
one at stake here; for instance, pg_perm_setlocale thinks it should make
gettext track the LC_CTYPE encoding, not anything else.

If we're willing to assume a lot about how locale names are spelled,
we could imagine fixing this in cache_locale_time by having it strip
any encoding spec from the given LC_TIME string and then adding on the
codeset name from nl_langinfo(CODESET). Not sure about how well
that'd play on Windows, though. We'd also need to adjust check_locale
so that it does the same dance.

BTW, it seems very likely that we have similar issues with LC_MONETARY
and LC_NUMERIC in PGLC_localeconv(). There's an interesting Windows-only
hack in there now that seems to be addressing more or less the same issue;
I wonder whether that would be rendered unnecessary if we approached it
like this?

I'm also wondering why we have not noticed any comparable problem with
LC_MESSAGES or LC_COLLATE. It's not so surprising that we haven't
understood this hazard before with LC_TIME/LC_MONETARY/LC_NUMERIC given
their limited usage in PG, but the same can't be said of LC_MESSAGES or
LC_COLLATE.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-04-19 17:00:38 Re: Idea for fixing parallel pg_dump's lock acquisition problem
Previous Message Robert Haas 2019-04-19 16:43:24 Re: POC: Cleaning up orphaned files using undo logs