From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: multibyte-character aware support for function "downcase_truncate_identifier()" |
Date: | 2010-11-21 23:22:31 |
Message-ID: | 4CE9A9B7.1080707@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/21/2010 06:09 PM, Robert Haas wrote:
> I think that's fair. It actually doesn't seem like it should be that
> hard if we knew that the server encoding were UTF8 - it's just a big
> translation table somewhere, no?
No, it's far more complex. See for example
<http://unicode.org/reports/tr21/tr21-3.html>, which says:
There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII.
* Because of the inclusion of certain composite characters for
compatibility, such as 01F1 "DZ" /capital dz/, there is a
third case, called /titlecase/, which is used where the first
letter of a word is to be capitalized (e.g. Titlecase, vs.
UPPERCASE, or lowercase).
o For example, the title case of the example character is
01F2 "Dz" /capital d with small z/.
* Case mappings may produce strings of different length than the
original.
o For example, the German character 00DF "ß" /small letter
sharp s/ expands when uppercased to the sequence of two
characters "SS". This also occurs where there is no
precomposed character corresponding to a case mapping,
such as with 0149 "'n" /latin small letter n preceded by
apostrophe./
* Characters may also have different case mappings, depending on
the context.
o For example, 03A3 "?" /capital sigma/ lowercases to 03C3
"?" /small sigma/ if it is followed by another letter,
but lowercases to 03C2 "?" /small final sigma/ if it is not.
* Characters may have case mappings that depend on the locale.
o For example, in Turkish the letter 0049 "I" /capital
letter i/ lowercases to 0131 "?" /small dotless i/.
* Case mappings are not, in general, reversible.
o For example, once the string "McGowan" has been
uppercased, lowercased or titlecased, the original
cannot be recovered by applying another uppercase,
lowercase, or titlecase operation.
cheers
andrew
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2010-11-21 23:24:30 | Re: multibyte-character aware support for function "downcase_truncate_identifier()" |
Previous Message | Robert Haas | 2010-11-21 23:21:36 | Re: knngist - 0.8 |