Re: character encoding in StartupMessage

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, John DeSoi <desoi(at)pgedit(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 16:45:27
Message-ID: 20060228164527.GF535@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 28, 2006 at 11:19:02AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> >>> This may be the only solution. Converting everything to UTF-8 has
> >>> issues because some encodings are not roundtrip-safe
>
> >> Is this still true?
>
> > I beleive so. If use the ICU Converter Explorer [1] to examine some of
> > the encodings we support, they have "Contains ambiguous aliases? TRUE".
>
> Which ones, and are they client-only encodings? If all our server-side
> encodings are round-trip safe then I think there's no big issue.
>
> In any case I don't think there's a huge problem if we say that database
> and user names had better be chosen from the round-trip-safe subset.

This is what it says here [1]:

There are only 19 encodings currently used worldwide as legitimate
POSIX multi-byte locale encodings:

UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6,
ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15,
EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII,
WINDOWS-1251, WINDOWS-1256

Each of these is fully roundtrip compatible to ISO 10646, therefore
all these locales can be represented nicely in wchar_t as the
equivalent UCS values. The above names and the corresponding defining
documents are listed in the IANA charset registry.

Some of these have multiple definitions according to ICU meaning that
different platforms have implemented them differently in the past
(EUC-JP falls into this catagory), but presumably the IANA charset
registry has proper definitions.

Of the reminaing encodings we support, Big5 is OK, although the term
win-950 which is the windows version has changed over time. GBK has
same problem, win-936 has changed to over time. I don't think we should
concern ourselves with bugs in the windows encodings.

IOW, I think we are mostly safe.

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-02-28 16:52:11 Re: [PERFORM] temporary indexes
Previous Message Jim C. Nasby 2006-02-28 16:45:15 Re: temporary indexes