Re: Java's Unicode Notation

From: Patrice Hédé <phede-ml(at)islande(dot)org>
To: Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Java's Unicode Notation
Date: 2001-11-12 18:03:14
Message-ID: 20011112190314.A2495@idf.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm answering to the original mail, as it has the description itself.

* Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr> [011107 22:04]:
> Dear all,
>
> Could it be possible to use the Java Unicode Notation to define UTF-8
> strings in PostgreSQL 7.2.
> Information can be found on http://czyborra.com/utf/
>
> Best regards,
> Jean-Michel pOURE
>
> ************************************************
>
> Java's Unicode Notation
> There are some less compact but more readable ASCII transformations
> the most important of which is the Java Unicode Notation as allowed
> in Java source code and processed by Java's native2ascii converter:
>
> putwchar(c)
> {
> if (c >= 0x10000) {
> printf ("\\u%04x\\u%04x" , 0xD7C0 + (c >> 10), 0xDC00 | c & 0x3FF);
> }
> else if (c >= 0x100) printf ("\\u%04x", c);
> else putchar (c);
> }
>
> The advantage of the \u20ac notation is that it is very easy to type
> it in on any old ASCII keyboard and easy to look up the intended
> character if you happen to have a copy of the Unicode book or the
> {unidata2,names2,unihan}.txt files from the Unicode FTP site or
> CD-ROM or know what U+20AC is the €.
^^^
Was that the codepoint for the windows proprietary charset for the
Euro, disguised in a mail advertising itself as "iso-8859-1", which
doesn't have the euro sign ? ;)

[No wonder Unicode is really needed in Europe !]

> What's not so nice about the \u20ac notation is that the small
> letters are quite unusual for Unicode characters, the backslashes
> have to be quoted for many Unix tools, the four hexdigits without a
> terminator may appear merged with the following word as in \u00a333
> for £33, it is unclear when and how you have to escape the backslash
> character itself, 6 bytes for one character may be considered
> wasteful, and there is no way to clearly present the characters
> beyond \uffff without \ud800\udc00 surrogates, and last but not
> least the plain hexnumbers may not be very helpful.
>
> JAVA is one of the target and source encodings of yudit and its
> uniconv converter.

I have to disagree about this feature... well, not about the idea, but
the implementation.

First, the use of surrogates to describe > 0x010000 codepoints.
Surrogates are NOT Unicode codepoints. They only exist in UTF-16
encoding, which is the encoding used by Java and Windows. However,
PostgreSQL, as most Unix tools, uses UTF-8 as encoding.

Encoding codepoints over 0xffff with two surrogates in UTF-8 is
illegal... So, you should forget about this, as this is an unnatural
extra step.

I've seen somewhere the notation \v010000 (using \v for 6-char
codepoints). But I don't like it too much either.

I agree with your idea of being able to express unicode codepoints
directly with escape characters. I personally like Perl's solution :

\x{20ac}
\x{010123}
\x{7e}

Using the braces, it makes it unambiguous to deal with codepoint
length (I've often myself put one "0" too much or not enough in
unicode code point descriptions).

I don't mind \u{...} instead of \x{...}. But a lot of PostgreSQL users
would be familiar with \x{} notation :) [Me being the first one]

I think that this is something for psql however. Where is "\n"
translated, for example ? Anyway, for 7.3... :)

Patrice.

--
Patrice Hédé
email: patrice hede à islande org
www : http://www.islande.org/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-11-12 18:32:20 Re: Possible major bug in PlPython (plus some other ideas)
Previous Message mlw 2001-11-12 18:03:11 rename index?