Re: Unicode restriction

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: olly(at)lfix(dot)co(dot)uk
Cc: pgsql-hackers(at)postgresql(dot)org, 232217(at)bugs(dot)debian(dot)org
Subject: Re: Unicode restriction
Date: 2004-08-03 12:59:33
Message-ID: 20040803.215933.78705094.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> In src/backend/utils/mb/wchar.c there is a check to exclude Unicode
> characters above 0x10000. I can't see anything to explain this
> restriction, except possibly this in the release notes for 7.2:
>
> Reject invalid multibyte character sequences (Tatsuo)
>
> It does not explain why part of the Unicode character range is invalid.
> There is a Debian bug report from someone whose client is trying to
> store characters in the excluded range. What would be needed to enable
> support for it?

Before 7.4, to be handled by regex routines, UTF-8 are converted to
ISO 10646. There was a limitaion in regex routines in that they cannot
handle multibyte characters > 2bytes. In another word only 16bit UCS-2
are supported. That's why ISO 10646 > 0x10000 is rejected.

I'm not sure if the regex routines include in 7.4 or later has this
restrictions or not. If not, probably we could remove the check (with
losing data compatibilty).
--
Tatsuo Ishii

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Sabino Mullane 2004-08-03 13:44:12 Re: Open items
Previous Message Gavin Sherry 2004-08-03 12:43:04 Re: Anybody have an Oracle PL/SQL reference at hand?