From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg_collation.collversion for C.UTF-8 |
Date: | 2023-06-07 15:08:10 |
Message-ID: | e22d6e2e-9981-4f8c-8351-0c6c9e84b63c@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I wrote:
> Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we
> only match ASCII characters 0-9, or 10 codepoints. With
> "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the
> digit characters in all languages, plus a bunch of variants for
> mathematical symbols.
BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu"
difference.
If think that any glibc-based locale will consider that \d
in a regexp means [0-9], and that any ICU locale
will make \d match a much larger variety of characters.
While moving to ICU by default, we should expect that
differences like that will affect apps in a way that might be
more or less disruptive.
Another known difference it that upper() with ICU does not do a
character-by-character conversion, for instance:
WITH words(w) as (values('muß'),('final'))
SELECT
w,
length(w),
upper(w collate "C.utf8") as "upper (libc)",
length(upper(w collate "C.utf8")),
upper(w collate "en-x-icu") as "upper (ICU)",
length(upper(w collate "en-x-icu"))
FROM words;
w | length | upper libc | length | upper ICU | length
------+--------+------------+--------+-----------+--------
muß | 3 | MUß | 3 | MUSS | 4
final | 4 | fiNAL | 4 | FINAL | 5
The fact that the resulting string is larger that the original
might cause problems.
In general, we can't abstract from the fact that ICU semantics
are different.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | Tristan Partin | 2023-06-07 15:26:59 | Re: Improve join_search_one_level readibilty (one line change) |
Previous Message | 謝東霖 | 2023-06-07 15:05:19 | Re: Improve join_search_one_level readibilty (one line change) |