Re: Built-in CTYPE provider

From: Jeremy Schneider <schneider(at)ardentperf(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org, "Davis, Jeff" <jefdavj(at)amazon(dot)com>
Subject: Re: Built-in CTYPE provider
Date: 2023-12-20 23:47:51
Message-ID: 8a1ae216-8150-41e2-a98d-09c57e3dc90f@ardentperf.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/5/23 3:46 PM, Jeff Davis wrote:
> CTYPE, which handles character classification and upper/lowercasing
> behavior, may be simpler than it first appears. We may be able to get
> a net decrease in complexity by just building in most (or perhaps all)
> of the functionality.
>
> === Character Classification ===
>
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
>
> === LOWER()/INITCAP()/UPPER() ===
>
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.

I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 🙂

I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.

I haven't checked the source code for the regexp_* functions yet, but
are these just passing through to an external library? Are we actually
able to easily change the CTYPE provider for them? If nobody
knows/replies then I'll find some time to look.

One other thing that comes to mind: how does the parser do case folding
for relation names? Is that using OS-provided libc as of today? Or did
we code it to use ICU if that's the DB default? I'm guessing libc, and
global catalogs probably need to be handled in a consistent manner, even
across different encodings.

(Kindof related... did you ever see the demo where I create a user named
'🏃' and then I try to connect to a database with non-unicode encoding?
💥😜 ...at least it seems to be able to walk the index without decoding
strings to find other users - but the way these global catalogs work
scares me a little bit)

-Jeremy

--
http://about.me/jeremy_schneider

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-12-21 00:04:38 Re: Add isCatalogRel in rmgrdesc
Previous Message Tom Lane 2023-12-20 23:47:44 Re: pg_upgrade failing for 200+ million Large Objects