Re: invalidly encoded strings

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: andrew(at)dunslane(dot)net
Cc: ishii(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, laurenz(dot)albe(at)wien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: invalidly encoded strings
Date: 2007-09-11 01:14:40
Message-ID: 20070911.101440.102552554.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

> Tatsuo Ishii wrote:
> >
> > I don't understand whole discussion.
> >
> > Why do you think that employing the Unicode code point as the chr()
> > argument could avoid endianness issues? Are you going to represent
> > Unicode code point as UCS-4? Then you have to specify the endianness
> > anyway. (see the UCS-4 standard for more details)
> >
>
> The code point is simply a number. The result of chr() will be a text
> value one char (not one byte) wide, in the relevant database encoding.
>
> U+nnnn maps to the same Unicode char and hence the same UTF8 encoding
> pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on
> all machines. So to get this char in a UTF8 database you could call
> "select chr(169)" and get back the byte pattern \xC2A9.

If you regard the unicode code point as simply a number, why not
regard the multibyte characters as a number too? I mean, since 0xC2A9
= 49833, "select chr(49833)" should work fine no?

Also I'm wondering you what we should do with different
backend/frontend encoding combo. For example, if your database is in
UTF-8, and your client encoding is LATIN2, what integer value should
be passed to chr()? LATIN2 or Unicode code point?

> > Or are you going to represent Unicode point as a character string such
> > as 'U+0259'? Then representing any encoding as a string could avoid
> > endianness issues anyway, and I don't see Unicode code point is any
> > better than others.
> >
>
> The argument will be a number, as now.
>
> > Also I'd like to point out all encodings has its own code point
> > systems as far as I know. For example, EUC-JP has its corresponding
> > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
> > we can't use "code point" as chr()'s argument for othe encodings(of
> > course we need optional parameter specifying which character set is
> > supposed).
> >
>
> Where can I find the tables that map code points (as opposed to
> encodings) to characters for these others?

You mean code point table of character set? The actual standard is not
on the web since it is copyrighted by the Japanese goverment (you need
to buy as a book or a pdf file). However you could find many code
point tables on the web. For example, JIS X 0208 code points can be
found on:

http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html

(you need to have a Japanese font and set page encoding to Shift JIS)

BTW, if you want to pass "code point of character set" rather than
encoding value, you need to give chr() what character set you are
reffering to. So we need to have two arguments, one is for code point,
the other is for character set specification. What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2007-09-11 01:15:49 Re: invalidly encoded strings
Previous Message Neil Conway 2007-09-11 01:11:57 Re: "txn" in pg_stat_activity

Browse pgsql-patches by date

  From Date Subject
Next Message Tatsuo Ishii 2007-09-11 01:15:49 Re: invalidly encoded strings
Previous Message Heikki Linnakangas 2007-09-10 21:59:44 Re: Yet more tsearch refactoring