Skip site navigation (1) Skip section navigation (2)

Re: character encoding in StartupMessage

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>,John DeSoi <desoi(at)pgedit(dot)com>,PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: character encoding in StartupMessage
Date: 2006-02-28 16:45:27
Message-ID: 20060228164527.GF535@svana.org (view raw or flat)
Thread:
Lists: pgsql-hackers
On Tue, Feb 28, 2006 at 11:19:02AM -0500, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> >>> This may be the only solution. Converting everything to UTF-8 has
> >>> issues because some encodings are not roundtrip-safe
> 
> >> Is this still true?
> 
> > I beleive so. If use the ICU Converter Explorer [1] to examine some of
> > the encodings we support, they have "Contains ambiguous aliases? TRUE".
> 
> Which ones, and are they client-only encodings?  If all our server-side
> encodings are round-trip safe then I think there's no big issue.
> 
> In any case I don't think there's a huge problem if we say that database
> and user names had better be chosen from the round-trip-safe subset.

This is what it says here [1]:

  There are only 19 encodings currently used worldwide as legitimate
  POSIX multi-byte locale encodings:

    UTF-8, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-5, ISO-8859-6,
    ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-13, ISO-8859-15,
    EUC-JP, EUC-KR, GB2312 (= EUC-CN), KOI8-R, KOI8-U, VISCII,
    WINDOWS-1251, WINDOWS-1256

  Each of these is fully roundtrip compatible to ISO 10646, therefore
  all these locales can be represented nicely in wchar_t as the
  equivalent UCS values. The above names and the corresponding defining
  documents are listed in the IANA charset registry.

Some of these have multiple definitions according to ICU meaning that
different platforms have implemented them differently in the past
(EUC-JP falls into this catagory), but presumably the IANA charset
registry has proper definitions.

Of the reminaing encodings we support, Big5 is OK, although the term
win-950 which is the windows version has changed over time. GBK has
same problem, win-936 has changed to over time. I don't think we should
concern ourselves with bugs in the windows encodings.

IOW, I think we are mostly safe.

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
-- 
Martijn van Oosterhout   <kleptog(at)svana(dot)org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

In response to

pgsql-hackers by date

Next:From: Tom LaneDate: 2006-02-28 16:52:11
Subject: Re: [PERFORM] temporary indexes
Previous:From: Jim C. NasbyDate: 2006-02-28 16:45:15
Subject: Re: temporary indexes

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group