Re: Built-in CTYPE provider

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-13 15:34:15
Message-ID: d26df384-2fa7-4f50-b703-b0b6706dbeff@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis wrote:

> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
> code points that have context-sensitive mappings, and three locales
> which have special conversions ("lt", "tr", and "az") for a few code
> points.

But there are CLDR mappings on top of that.

According to the Unicode FAQ

https://unicode.org/faq/casemap_charprop.html#5

Q: Does the default case mapping work for every language? What
about the default case folding?

[...]

To make case mapping language sensitive, the Unicode Standard
specificially allows implementations to tailor the mappings for
each language, but does not provide the necessary data. The file
SpecialCasing.txt is included in the Standard as a guide to a few
of the more important individual character mappings needed for
specific languages, notably the Greek script and the Turkic
languages. However, for most language-specific mappings and
tailoring, users should refer to CLDR and other resources.

In particular "el" (modern greek) has case mapping rules that
ICU seems to implement, but "el" is missing from the list
("lt", "tr", and "az") you identified.

The CLDR case mappings seem to be found in
https://github.com/unicode-org/cldr/tree/main/common/transforms
in *-Lower.xml and *-Upper.xml

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2023-12-13 15:54:33 "pgoutput" options missing on documentation
Previous Message Sacha Hottinger 2023-12-13 15:18:02 AW: Building PosgresSQL with LLVM fails on Solaris 11.4