Re: [PATCHES] UNICODE characters above 0x10000

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, john(at)geeknet(dot)com(dot)au, pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCHES] UNICODE characters above 0x10000
Date: 2004-08-07 16:43:20
Message-ID: 350.1091897000@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> On Sat, 7 Aug 2004, Tatsuo Ishii wrote:
>> Anyway my point is if current specification of Unicode only allows
>> 24-bit range, why we need to allow usage against the specification?

> Is there a specific reason you want to restrict it to 24 bits?

I see several places that have to allocate space on the basis of the
maximum encoded character length possible in the current encoding
(look for uses of pg_database_encoding_max_length). Probably the only
one that's really significant for performance is text_substr(), but
that's enough to be an argument against setting maxmblen higher than
we have to.

It looks to me like supporting 4-byte UTF-8 characters would be enough
to handle the existing range of Unicode codepoints, and that is probably
as much as we want to do.

If I understood what I was reading, this would take several things:
* Remove the "special UTF-8 check" in pg_verifymbstr;
* Extend pg_utf2wchar_with_len and pg_utf_mblen to handle the 4-byte case;
* Set maxmblen to 4 in the pg_wchar_table[] entry for UTF-8.

Are there any other places that would have to change? Would this break
anything? The testing aspect is what's bothering me at the moment.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-08-07 16:59:17 Re: Vacuum Cost Documentation?
Previous Message Bernd Helmle 2004-08-07 15:34:54 Backend crashes with notification rule

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2004-08-07 16:53:49 Re: PITR on Win32 - Archive and Restore Command Strings
Previous Message John Hansen 2004-08-07 13:40:36 Re: [PATCHES] UNICODE characters above 0x10000