Re: Unicode support

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: peter_e(at)gmx(dot)net (Peter Eisentraut), Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode support
Date: 2009-04-14 23:00:32
Message-ID: 87r5zualin.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Peter" == Peter Eisentraut <peter_e(at)gmx(dot)net> writes:

> On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
>> FWIW, the SQL spec puts the onus of normalization squarely on the
>> application; the database is allowed to assume that Unicode
>> strings are already normalized, is allowed to behave in
>> implementation-defined ways when presented with strings that
>> aren't normalized, and provision of normalization functions and
>> predicates is just another optional feature.

Peter> Can you name chapter and verse on that?

4.2.8 Universal character sets

A UCS string is a character string whose character repertoire is UCS
and whose character encoding form is one of UTF8, UTF16, or
UTF32. Any two UCS strings are comparable.

An SQL-implementation may assume that all UCS strings are normalized
in one of Normalization Form C (NFC), Normalization Form D (NFD),
Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as
specified by [Unicode15]. <normalized predicate> may be used to
verify the normalization form to which a particular UCS string
conforms. Applications may also use <normalize function> to enforce
a particular <normal form>. With the exception of <normalize function>
and <normalized predicate>, the result of any operation on an
unnormalized UCS string is implementation-defined.

Conversion of UCS strings from one character set to another is
automatic.

Detection of a noncharacter in a UCS-string causes an exception
condition to be raised. The detection of an unassigned code point
does not.

[Obviously there are things here that we don't conform to anyway (we
don't raise exceptions for noncharacters, for example. We don't claim
conformance to T061.]

<normalized predicate> ::=
<row value predicand> <normalized predicate part 2>
<normalized predicate part 2> ::=
IS [ NOT ] [ <normal form> ] NORMALIZED

1) Without Feature T061, "UCS support", conforming SQL language shall
not contain a <normalized predicate>.

2) Without Feature F394, "Optional normal form specification",
conforming SQL language shall not contain <normal form>.

<normalize function> ::=
NORMALIZE <left paren> <character value expression>
[ <comma> <normal form> [ <comma> <normalize function result length> ] ] <right paren>

<normal form> ::=
NFC
| NFD
| NFKC
| NFKD

7) Without Feature T061, "UCS support", conforming SQL language shall
not contain a <normalize function>.

9) Without Feature F394, "Optional normal form specification",
conforming SQL language shall not contain <normal form>.

Peter> I see this, for example,

Peter> 6.27 <numeric value function>
[...]
Peter> So SQL redirects the question of character length the Unicode
Peter> standard. I have not been able to find anything there on a
Peter> quick look, but I'm sure the Unicode standard has some very
Peter> specific ideas on this. Note that the matter of normalization
Peter> is not mentioned here.

I've taken a not-so-quick look at the Unicode standard (though I don't
claim to be any sort of expert on it), and I certainly can't see any
definitive indication what the length is supposed to be; however, the
use of terminology such as "combining character sequence" (meaning a
series of codepoints that combine to make a single glyph) certainly
seems to strongly imply that our interpretation is correct and that
the OP's is not.

Other indications: the units used by length() must be the same as the
units used by position() and substring() (in the spec, when USING
CHARACTERS is specified), and it would not make sense to use a
definition of "character" that did not allow you to look inside a
combining sequence.

I've also failed so far to find any examples of other programming
languages in which a combining character sequence is taken to be a
single character for purposes of length or position specification.

--
Andrew (irc:RhodiumToad)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-04-14 23:04:25 Re: proposal: add columns created and altered to pg_proc and pg_class
Previous Message Bruce Momjian 2009-04-14 22:52:24 Re: proposal: add columns created and altered to pg_proc and pg_class