Re: Unicode support

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, - - <crossroads0000(at)googlemail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode support
Date: 2009-04-13 21:04:17
Message-ID: 49E3A8D1.9010607@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> This isn't about the number of bytes, but about whether or not we should
>> count characters encoded as two or more combined code points as a single
>> char or not.
>>
>
> It's really about whether we should support non-canonical encodings.
> AFAIK that's a hack to cope with implementations that are restricted
> to UTF-16, and we should Just Say No. Clients that are sending these
> things converted to UTF-8 are in violation of the standard.
>

I don't believe that the standard forbids the use of combining chars at
all. RFC 3629 says:

Security may also be impacted by a characteristic of several
character encodings, including UTF-8: the "same thing" (as far as a
user can tell) can be represented by several distinct character
sequences. For instance, an e with acute accent can be represented
by the precomposed U+00E9 E ACUTE character or by the canonically
equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though
UTF-8 provides a single byte sequence for each character sequence,
the existence of multiple character sequences for "the same thing"
may have security consequences whenever string matching, indexing,
searching, sorting, regular expression matching and selection are
involved. An example would be string matching of an identifier
appearing in a credential and in access control list entries. This
issue is amenable to solutions based on Unicode Normalization Forms,
see [UAX15].

cheers

andrew

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message - - 2009-04-13 21:11:21 Re: Unicode support
Previous Message Tom Lane 2009-04-13 20:39:44 Re: Unicode support