Quick Links

Built-in case-insensitive collation pg_unicode_ci

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Built-in case-insensitive collation pg_unicode_ci
Date:	2025-09-20 00:21:34
Message-ID:	477013748dc89440dca476dbbea4bec18f1aa77e.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

--------
Proposal
--------

New builtin case-insensitive collation PG_UNICODE_CI, where the
ordering semantics are just:

strcmp(CASEFOLD(arg1), CASEFOLD(arg2))

and the character semantics are the same as PG_UNICODE_FAST.

This does not perform the Unicode Collation algorithm (UCA), and it is
not a natural language collation. It does not normalize the inputs,
either. It's just comparing codepoint values after folding.

One interesting case is something like:

SELECT 'straße' = 'STRASSE' COLLATE pg_unicode_ci;

which is true, because 'ß' folds to 'ss' when using Unicode Default
Case Folding. (Note that ICU only makes this equivalence at "level1";
at "level2", the strings would compare as unequal in ICU.)

----------
Motivation
----------

Non-deterministic collations cannot be used by SIMILAR TO, and may
cause problems for ILIKE and regexes. The reason is that pattern
matching often depends on the character-by-character semantics, but ICU
collations aren't constrained enough for these semantics to work. See:

"But the definition there is pretty much impossible to implement for
nondeterministic collations: It basically says, the predicate is true
if the string to be matched is equal, using the applicable collation,
to any of the strings in the set of strings described by the regular
expression."

https://www.postgresql.org/message-id/899e7b5f-b54a-4e1b-9218-bb23534fc2c4%40eisentraut.org

However, PG_UNICODE_CI collation does have character-by-character
semantics which are well-defined for pattern matching.

That takes us a step closer to allowing the database default collation
to be case-insensitive.

Some also might prefer the simplicity of PG_UNICODE_CI compared with
ICU.

ICU is still much more flexible for users who know what they want:
normalization, various levels of sensitivity, tailored behavior, etc.

-------
Details
-------

Patch attached.

Implementation-wise, it folds one codepoint at a time to avoid
unnecessary table lookups. The code must be careful about the case
where the result of CASEFOLD() is a different size for each input, and
perhaps still leading to a match.

Does not implement pattern matching. Would need some discussion to see
how it should be integrated with Peter's work.

It's currently slower than ICU, but I don't think there's any inherent
reason.

-------------
Normalization
-------------

The argument could be made that we should both normalize and casefold
before comparing. ICU does that, or something like that, and we have
the infrastructure to do it. But getting that to be both correct and
fast is non-trivial, so I didn't want to add the complexity. It's also
not clear that we want to do that, or at least not all the time.

One option would be to introduce a normalizing collation PG_UNICODE_NCI
later, if needed. And at the same time we could also introduce
NCASEFOLD() which would have corresponding semantics.

----------
Versioning
----------

Unlike other built-in collations, the order does depend on the version
of Unicode, so the collation is given a version equal to the version of
Unicode. (Other builtin collations have a version of "1".)

That means that indexes, including primary keys, can become
inconsistent after a major version upgrade if the version of Unicode
has changed. The conditions where this can happen are much narrower
than with libc or ICU collations:

(a) The database in the prior version must contain code points
unassigned as of that version; and
(b) Some of those previously-unassigned code points must be assigned
to a Cased character in the newer version.

It's a smaller problem than a libc or ICU upgrade, which can cause
differences in sort order for the same reason (unassigned codepoints
later being assigned) as well as many other reasons.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v1-0001-Introduce-case-insensitive-PG_UNICODE_CI.patch	text/x-patch	18.4 KB

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Davis	2025-09-20 01:03:46	Re: Improve the performance of Unicode Normalization Forms.
Previous Message	John H	2025-09-19 23:42:56	Re: Introduce XID age based replication slot invalidation