Re: Built-in CTYPE provider

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-01-10 22:56:23
Message-ID: 6f3e94c0-f174-4380-9b69-072f8a838881@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis wrote:

> Attached a more complete version that fixes a few bugs

[v15 patch]

When selecting the builtin provider with initdb, I'm getting the
following setup:

$ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata

The database cluster will be initialized with this locale configuration:
default collation provider: builtin
default collation locale: C.UTF-8
LC_COLLATE: C.UTF-8
LC_CTYPE: C.UTF-8
LC_MESSAGES: C.UTF-8
LC_MONETARY: C.UTF-8
LC_NUMERIC: C.UTF-8
LC_TIME: C.UTF-8
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

This is from an environment where LANG=fr_FR.UTF-8

I would expect all LC_* variables to be fr_FR.UTF-8, and the default
text search configuration to be "french". It is what happens
when selecting ICU as the provider in the same environment:

$ bin/initdb --icu-locale=en --locale-provider=icu -D/tmp/pgdata

Using language tag "en" for ICU locale "en".
The database cluster will be initialized with this locale configuration:
default collation provider: icu
default collation locale: en
LC_COLLATE: fr_FR.UTF-8
LC_CTYPE: fr_FR.UTF-8
LC_MESSAGES: fr_FR.UTF-8
LC_MONETARY: fr_FR.UTF-8
LC_NUMERIC: fr_FR.UTF-8
LC_TIME: fr_FR.UTF-8
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "french".

The collation setup does not influence the rest of the localization.
The problem AFAIU is that --locale has two distinct
meanings in the v15 patch:
--locale-provider=X --locale=Y means use "X" as the provider
with "Y" as datlocale, and it means use "Y" as the locale for all
localized libc functionalities.

I wonder what would happen if invoking
bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
on a system where C.UTF-8 does not exist as a libc locale.
Would it fail? (I don't have an OS like this to test ATM, will try later).

A related comment is about naming the builtin locale C.UTF-8, the same
name as in libc. On one hand this is semantically sound, but on the
other hand, it's likely to confuse people. What about using completely
different names, like "pg_unicode" or something else prefixed by "pg_"
both for the locale name and the collation name (currently
C.UTF-8/c_utf8)?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-01-10 23:16:03 Re: initdb --no-locale=C doesn't work as specified when the environment is not C
Previous Message Michael Paquier 2024-01-10 22:55:29 Re: GUCifying MAX_WAL_SEND