Re: UTF8 national character data type support WIP patch and list of open issues.

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: robertmhaas(at)gmail(dot)com
Cc: maumau307(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, maksymb(at)fast(dot)au(dot)fujitsu(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2013-09-19 23:58:53
Message-ID: 20130920.085853.1628917054830864151.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual. But I'm not sure about this, because UTF-16 is not a real
>> fixed-width encoding due to supplementary characters.
>
> It seems to me that these two points here are the real core of your
> proposal. The rest is just syntactic sugar.
>
> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding. In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.

Agreed.

> As to the first one, if we're going to go to the (substantial) trouble
> of building infrastructure to allow a database to store data in
> multiple encodings, why limit it to storing UTF-8 in non-UTF-8
> databases? What about storing SHIFT-JIS in UTF-8 databases, or
> Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
> combination you might care to name?
>
> Whether we go that way or not, I think storing data in one encoding in
> a database with a different encoding is going to be pretty tricky and
> require far-reaching changes. You haven't mentioned any of those
> issues or discussed how you would solve them.

What about limiting to use NCHAR with a database which has same
encoding or "compatible" encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward. I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2013-09-20 00:00:33 Re: Looking for information on our elephant
Previous Message Oleg Bartunov 2013-09-19 23:57:30 Re: Looking for information on our elephant