From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Oleg Tselebrovskiy <o(dot)tselebrovskiy(at)postgrespro(dot)ru> |
Cc: | pgsql-docs(at)lists(dot)postgresql(dot)org |
Subject: | Re: Initcap works differently with different locale providers |
Date: | 2025-07-28 10:20:06 |
Message-ID: | 0658C8F0-5ED4-4962-A2A3-524B0D899982@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-docs |
Hi, Oleg!
> On 25 Sep 2024, at 18:13, Oleg Tselebrovskiy <o(dot)tselebrovskiy(at)postgrespro(dot)ru> wrote:
>
> Greetings, everyone!
>
> One of our clients has found a difference in behaviour of initcap function when
> using different locale providers, shown below
>
> postgres=# create database test_db_1 locale_provider=icu locale="ru_RU.UTF-8" template=template0;
> NOTICE: using standard form "ru-RU" for ICU locale "ru_RU.UTF-8"
> CREATE DATABASE
> postgres=# \c test_db_1;
> You are now connected to database "test_db_1" as user "postgres".
> test_db_1=# select initcap('ЧиЮ А.Ю.');
> initcap
> ----------
> Чию А.ю.
> (1 row)
> test_db_1=# select initcap('joHn d.e.');
> initcap
> -----------
> John D.e.
> (1 row)
> postgres=# create database test_db_2 locale_provider=libc locale="ru_RU.UTF-8" template=template0;
> CREATE DATABASE
> postgres=# \c test_db_2
> You are now connected to database "test_db_2" as user "postgres".
> test_db_2=# select initcap('ЧиЮ А.Ю.');
> initcap
> ----------
> Чию А.Ю.
> (1 row)
> test_db_2=# select initcap('joHn d.e.');
> initcap
> -----------
> John D.E.
> (1 row)
>
> And an easier reproduction (should work for REL_12_STABLE and up)
>
> postgres=# SELECT initcap('first.second' COLLATE "en-x-icu");
> initcap
> --------------
> First.second
> (1 row)
> postgres=# SELECT initcap('first.second' COLLATE "en_US");
> initcap
> --------------
> First.Second
> (1 row)
>
> This behaviour is reproducible on REL_12_STABLE and up to master
>
> I don't believe that this is an erroneous behaviour, just a differing one, hence
> just a documentation change proposition
>
> I suggest adding a clarification that this function works differently with libc
> and ICU providers because there is a difference in what a "word" is between them
>
> In libc a word is a sequence of alphanumeric characters, separated by
> non-alphanumeric characters (as it is written in documentation right now)
> In ICU words are divided according to Unicode® Standard Annex #29 [1]
>
> Similar issue was briefly discussed in [2]
>
> The suggested documentation patch is attached (versions for REL_13_STABLE+ and
> for REL_12_STABLE only)
>
> [1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
> [2]: https://www.postgresql.org/message-id/CAEwbS1R8pwhRkwRo3XsPt24ErBNtFWuReAZhVPJwA3oqo148tA%40mail.gmail.com
>
> Oleg Tselebrovskiy, Postgres Professional<v1-0001-string-functions.patch><v1-0002-string-functions-REL_12.patch>
I can confirm inicap works with libc and libicu as you stated. The documentation patch looks good to me. I’ve written a commit message. The REL_12_STABLE branch is not relevant anymore as it’s out of support. I’m going to push this if no objections.
------
Regards,
Alexander Korotkov
Supabase

From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2025-07-28 10:23:28 | Re: Initcap works differently with different locale providers |
Previous Message | David G. Johnston | 2025-07-25 02:55:06 | Re: Clarification on the column order of UNION, INTERSECT, and EXCEPT |