Re: [HACKERS] Unicode combining characters

From: Patrice Hédé <phede-ml(at)islande(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: [HACKERS] Unicode combining characters
Date: 2001-10-09 17:07:38
Message-ID: 20011009190738.H14587@idf.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

* Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> [011009 18:38]:
> > - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
> > characters (characters with values >= 0x10000, which are encoded on
> > four bytes).
>
> After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
> bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
> bytes, we are in trouble. Current regex implementaion does not handle
> 4 byte width charsets.

*sigh* yes, it does encode to four bytes :(

Three solutions then :

1) we support these supplementary characters, knowing that they won't
work with regexes,

2) I back out the change, but then anyone using these characters will
get something weird, since the decoding would be faulty (they would
be handled as 3 bytes UTF-8 chars, and then the fourth byte would
become a "faulty char"... not very good, as the 3-byte version is
still not a valid UTF-8 code !),

3) we fix the regex engine within the next 24 hours, before the beta
deadline is activated :/

I must say that I doubt that anyone will use these characters in the
next few months : these are mostly chinese extended characters, with
old italic, deseret, and gothic scripts, and bysantine and western
musical symbols, as well as the mathematical alphanumerical symbols.

I would prefer solution 1), as I think it is better to allow these
characters, even with a temporary restriction on the regex, than to
fail completely on them. As for solution 3), we may still work at it
in the next few months :) [I haven't even looked at the regex engine
yet, so I don't know the implications of what I have just said !]

What do you think ?

Patrice

--
Patrice Hédé
email: patrice hede à islande org
www : http://www.islande.org/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2001-10-09 17:13:01 Re: [HACKERS] What about CREATE OR REPLACE FUNCTION?
Previous Message Tatsuo Ishii 2001-10-09 14:16:56 Re: [HACKERS] Unicode combining characters

Browse pgsql-patches by date

  From Date Subject
Next Message John Gray 2001-10-09 23:18:15 Re: Efficient slicing/substring of TOAST values (for
Previous Message Bruce Momjian 2001-10-09 15:37:19 Re: updated patch for Chinese NLS support (simplified)