Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Built-in CTYPE provider
Date: 2023-12-05 23:46:06
Message-ID: ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

CTYPE, which handles character classification and upper/lowercasing
behavior, may be simpler than it first appears. We may be able to get
a net decrease in complexity by just building in most (or perhaps all)
of the functionality.

Unicode offers relatively simple rules for CTYPE-like functionality
based on data files. There are a few exceptions and a few options,
which I'll address below.

(In contrast, collation varies a lot from locale to locale, and has a
lot more options and nuance than ctype.)

=== Proposal ===

Parse some Unicode data files into static lookup tables in .h files
(similar to what we already do for normalization) and provide
functions to perform the right lookups according to Unicode
recommentations[1][2]. Then expose the functionality as either a
specially-named locale for the libc provider, or as part of the
built-in collation provider which I previously proposed[3]. (Provided
patches don't expose the functionality yet; I'm looking for feedback
first.)

Using libc or ICU for a CTYPE provider would still be supported, but
as I explain below, there's not nearly as much reason to do so as you
might expect. As far as I can tell, using an external provider for
CTYPE functionality is mostly unnecessary complexity and magic.

There's still plenty of reason to use the plain "C" semantics, if
desired, but those semantics are already built-in.

=== Benefits ===

* platform-independent ctype semantics based on Unicode, not tied to
any dependency's implementation
* ability to combine fast memcmp() collation with rich ctype
semantics
* user-visible semantics can be documented and tested
* stability within a PG major version
* transparency of changes: tables would be checked in to .h files,
so whoever runs the "update-unicode" build target would see if
there are unexpected or impactful changes that should be addressed
in the release notes
* the built-in tables themselves can be tested exhaustively by
comparing with ICU so we can detect trivial parsing errors and the
like

=== Character Classification ===

Character classification is used for regexes, e.g. whether a character
is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
class. Unicode defines what character properties map into these
classes in TR #18 [1], specifying both a "Standard" variant and a
"POSIX Compatible" variant. The main difference with the POSIX variant
is that symbols count as punctuation.

Character classification in Unicode does not vary from locale to
locale. The same character is considered to be a member of the same
classes regardless of locale (in other words, there's no
"tailoring"). There is no strong compatibility guarantee around the
classification of characters, but it doesn't seem to change much in
practice (I could collect more data here if it matters).

In glibc, character classification is not affected by the locale as
far as I can tell -- all non-"C" locales behave like "C.UTF-8"
(perhaps other libc implementations or versions or custom locales
behave differently -- corrections welcome). There are some differences
between "C.UTF-8" and what Unicode seems to recommend, and I'm not
entirely sure why those differences exist or whether those differences
are important for anything other than compatibility.

Note: ICU offers character classification based on Unicode standards,
too, but the fact that it's an external dependency makes it a
difficult-to-test black box that is not tied to a PG major
version. Also, we currently don't use the APIs that Unicode
recommends; so in Postgres today, ICU-based character classification
is further from Unicode than glibc character classification.

=== LOWER()/INITCAP()/UPPER() ===

The LOWER() and UPPER() functions are defined in the SQL spec with
surprising detail, relying on specific Unicode General Category
assignments. How to map characters seems to be left (implicitly) up to
Unicode. If the input string is normalized, the output string must be
normalized, too. Weirdly, there's no room in the SQL spec to localize
LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
specifies one example, which is that "ß" becomes "SS" when folded to
upper case. INITCAP() is not in the SQL spec.

In Unicode, lowercasing and uppercasing behavior is a mapping[2], and
also backed by a strong compatibility guarantee that "case pairs" will
always remain case pairs[4]. The mapping may be "simple"
(context-insensitive, locale-insensitive, not adding any code points),
or "full" (may be context-sensitive, locale-sensitive, and one code
point may turn into 1-3 code points).

Titlecasing (INITCAP() in Postgres) in Unicode is similar to
upper/lowercasing, except that it has the additional complexity of
finding word boundaries, which have a non-trivial definition. To
simplify, we'd either use the Postgres definition (alphanumeric) or
the "word" character class specified in [1]. If someone wants more
sophisticated word segmentation they could use ICU.

While "full" case mapping sounds more complex, there are actually very
few cases to consider and they are covered in another (small) data
file. That data file covers ~100 code points that convert to multiple
code points when the case changes (e.g. "ß" -> "SS"), 7 code points
that have context-sensitive mappings, and three locales which have
special conversions ("lt", "tr", and "az") for a few code points.

ICU can do the simple case mapping (u_tolower(), etc.) or full mapping
(u_strToLower(), etc.). I see one difference in ICU that I can't yet
explain for the full titlecase mapping of a singular \+000345.

glibc in UTF8 (at least in my tests) just does the simple upper/lower
case mapping, extended with simple mappings for the locales with
special conversions (which I think are exactly the same 3 locales
mentioned above). libc doesn't do titlecase. If the resuling character
isn't representable in the server encoding, I think libc just maps the
character to itself, though I should test this assumption.

=== Encodings ===

It's easiest to implement these rules in UTF8, but possible for any
encoding where we can decode to a Unicode code point.

=== Patches ===

0001 & 0002 are just cleanup. I intend to commit them unless someone
has a comment.

0003 implements character classification ("Standard" and "POSIX
Compatible" variants) but doesn't actually use them for anything.

0004 implements "simple" case mapping, and a partial implementation of
"full" case mapping. Again, does not use them yet.

=== Questions ===

* Is a built-in ctype provider a reasonable direction for Postgres as
a project?
* Does it feel like it would be simpler or more complex than what
we're doing now?
* Do we want to just try to improve our ICU support instead?
* Do we want the built-in provider to be one thing, or have a few
options (e.g. "standard" or "posix" character classification;
"simple" or "full" case mapping)?

Regards,
Jeff Davis

[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G33992
[3]
https://www.postgresql.org/message-id/flat/9d63548c4d86b0f820e1ff15a83f93ed9ded4543(dot)camel(at)j-davis(dot)com
[4] https://www.unicode.org/policies/stability_policy.html#Case_Pair

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v2-0004-Add-unicode-case-mapping-tables-and-functions.patch text/x-patch 182.4 KB
v2-0003-Add-Unicode-property-tables.patch text/x-patch 91.5 KB
v2-0002-Shrink-unicode-category-table.patch text/x-patch 101.7 KB
v2-0001-Minor-cleanup-for-unicode-update-build-and-test.patch text/x-patch 7.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-12-05 23:55:11 Re: Faster "SET search_path"
Previous Message Davin Shearer 2023-12-05 23:45:24 Re: Emitting JSON to file using COPY TO