Re: multibyte-character aware support for function "downcase_truncate_identifier()"

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 23:22:31
Message-ID: 4CE9A9B7.1080707@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/21/2010 06:09 PM, Robert Haas wrote:
> I think that's fair. It actually doesn't seem like it should be that
> hard if we knew that the server encoding were UTF8 - it's just a big
> translation table somewhere, no?

No, it's far more complex. See for example
<http://unicode.org/reports/tr21/tr21-3.html>, which says:

There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII.

* Because of the inclusion of certain composite characters for
compatibility, such as 01F1 "DZ" /capital dz/, there is a
third case, called /titlecase/, which is used where the first
letter of a word is to be capitalized (e.g. Titlecase, vs.
UPPERCASE, or lowercase).
o For example, the title case of the example character is
01F2 "Dz" /capital d with small z/.
* Case mappings may produce strings of different length than the
original.
o For example, the German character 00DF "ß" /small letter
sharp s/ expands when uppercased to the sequence of two
characters "SS". This also occurs where there is no
precomposed character corresponding to a case mapping,
such as with 0149 "'n" /latin small letter n preceded by
apostrophe./
* Characters may also have different case mappings, depending on
the context.
o For example, 03A3 "?" /capital sigma/ lowercases to 03C3
"?" /small sigma/ if it is followed by another letter,
but lowercases to 03C2 "?" /small final sigma/ if it is not.
* Characters may have case mappings that depend on the locale.
o For example, in Turkish the letter 0049 "I" /capital
letter i/ lowercases to 0131 "?" /small dotless i/.
* Case mappings are not, in general, reversible.
o For example, once the string "McGowan" has been
uppercased, lowercased or titlecased, the original
cannot be recovered by applying another uppercase,
lowercase, or titlecase operation.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-21 23:24:30 Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Previous Message Robert Haas 2010-11-21 23:21:36 Re: knngist - 0.8