Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-01-11 19:05:36
Message-ID: 0f8d5290eb4bcab682e5fb8030ef3e24f6ed60f2.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote:
> $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
>  
>   The database cluster will be initialized with this locale
> configuration:
>     default collation provider:  builtin
>     default collation locale:    C.UTF-8
>     LC_COLLATE:  C.UTF-8
>     LC_CTYPE:    C.UTF-8
>     LC_MESSAGES: C.UTF-8
>     LC_MONETARY: C.UTF-8
>     LC_NUMERIC:  C.UTF-8
>     LC_TIME:     C.UTF-8
>   The default database encoding has accordingly been set to "UTF8".
>   The default text search configuration will be set to "english".
>
> This is from an environment where LANG=fr_FR.UTF-8
>
> I would expect all LC_* variables to be fr_FR.UTF-8, and the default
> text search configuration to be "french".

You can get the behavior you want by doing:

initdb --builtin-locale=C.UTF-8 --locale-provider=builtin \
-D/tmp/pgdata

where "--builtin-locale" is analogous to "--icu-locale".

It looks like I forgot to document the new initdb option, which seems
to be the source of the confusion. Sorry, I'll fix that in the next
patch set. (See examples in the initdb tests.)

I think this answers some of your follow-up questions as well.

> A related comment is about naming the builtin locale C.UTF-8, the
> same
> name as in libc. On one hand this is semantically sound, but on the
> other hand, it's likely to confuse people. What about using
> completely
> different names, like "pg_unicode" or something else prefixed by
> "pg_"
> both for the locale name and the collation name (currently
> C.UTF-8/c_utf8)?

I'm flexible on naming, but here are my thoughts:

* A "pg_" prefix makes sense.

* If we named it something like "pg_unicode" someone might expect it to
sort using the root collation.

* The locale name "C.UTF-8" is nice because it implies things about
both the collation and the character behavior. It's also nice because
on at least some platforms, the behavior is almost identical to the
libc locale of the same name.

* UCS_BASIC might be a good name, because it also seems to carry the
right meanings, but that name is already taken.

* We also might to support variations, such as full case mapping (which
uppercases "ß" to "SS", as the SQL standard requires), or perhaps the
"standard" flavor of regexes (which don't count all symbols as
punctuation). Leaving some room to name those variations would be a
good idea.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-01-11 19:12:20 buildfarm failures in pg_walsummary checks
Previous Message Robert Haas 2024-01-11 18:58:18 Re: pgsql: Add new pg_walsummary tool.