RE: PostgreSQL and Unicode

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: rmager(at)vgkk(dot)co(dot)jp
Cc: ishii(at)postgresql(dot)org, hackers(at)postgresql(dot)org
Subject: RE: PostgreSQL and Unicode
Date: 2000-05-16 07:08:55
Message-ID: 20000516160855E.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> My understanding of the problem is UTF8 is this. Functionally, it is
> equivalent to UCS-2, that is you can encode any Unicode character in UTF-8
> that you could encode in UCS-2.
> The problem we've run into is only related to Postgres. For example we had
> a field that was fixed at 20 characters. If we put in ASCII then we could
> put in all 20 characters. If we put in UTF8 encoded Japanese then (depending
> on which characters were used) we got about 3 UTF8 characters for each
> Japanese character. Aside from going from 20 characters to 7 (*problem #1*)
> we also now have unpredictable behavior. Some characters, like Japanese,
> were 3:1 ratio when encoding. UTF8 can go as high as 6:1 encoding ratio for
> some language (I don't know which off hand) this is *problem #2*. Finally,
> as a side affect of this, the string was just truncated so we sometimes got
> only a partial UTF8 character in the database. This made the unencoding
> either fail or produce weird results (*problem #3*).

Yes, I have noticed this problem too. But don't we have same problem
with UCS-2, with 2:1 ratio, then? I think we should fix this in the
way:
char(10) should means 10 letters, not 10 bytes no matter what
encoding we use

I will tackle this problem for 7.1.

How do you think, Rainer? Are you still unhappy with the solution
above?
--
Tatsuo Ishii

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2000-05-16 08:01:05 RE: PostgreSQL and Unicode
Previous Message Daniel Kalchev 2000-05-16 07:03:58 Re: WAL versus Postgres (or: what goes around, comes ar ound)