From: | Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> |
---|---|
To: | rmager(at)vgkk(dot)co(dot)jp |
Cc: | ishii(at)postgresql(dot)org, hackers(at)postgresql(dot)org |
Subject: | RE: PostgreSQL and Unicode |
Date: | 2000-05-16 07:08:55 |
Message-ID: | 20000516160855E.t-ishii@sra.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> My understanding of the problem is UTF8 is this. Functionally, it is
> equivalent to UCS-2, that is you can encode any Unicode character in UTF-8
> that you could encode in UCS-2.
> The problem we've run into is only related to Postgres. For example we had
> a field that was fixed at 20 characters. If we put in ASCII then we could
> put in all 20 characters. If we put in UTF8 encoded Japanese then (depending
> on which characters were used) we got about 3 UTF8 characters for each
> Japanese character. Aside from going from 20 characters to 7 (*problem #1*)
> we also now have unpredictable behavior. Some characters, like Japanese,
> were 3:1 ratio when encoding. UTF8 can go as high as 6:1 encoding ratio for
> some language (I don't know which off hand) this is *problem #2*. Finally,
> as a side affect of this, the string was just truncated so we sometimes got
> only a partial UTF8 character in the database. This made the unencoding
> either fail or produce weird results (*problem #3*).
Yes, I have noticed this problem too. But don't we have same problem
with UCS-2, with 2:1 ratio, then? I think we should fix this in the
way:
char(10) should means 10 letters, not 10 bytes no matter what
encoding we use
I will tackle this problem for 7.1.
How do you think, Rainer? Are you still unhappy with the solution
above?
--
Tatsuo Ishii
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2000-05-16 08:01:05 | RE: PostgreSQL and Unicode |
Previous Message | Daniel Kalchev | 2000-05-16 07:03:58 | Re: WAL versus Postgres (or: what goes around, comes ar ound) |