Re: Windows default locale vs initdb

From: Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Windows default locale vs initdb
Date: 2022-07-22 11:58:54
Message-ID: CAC+AXB10p+mnJ6wrAEm6jb51+8=BfYzD=w6ftHRbMjMuSFN3kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
> <juanjo(dot)santamaria(at)gmail(dot)com> wrote:
> > Still, WIN1252 is not the wrong answer for what we are asking. Even if
> you enable UTF-8 support [1], the system will use the current default
> Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
>
> I'm still confused about what that means. Suppose we decided to
> insist by adding a ".UTF-8" suffix to the name, as that page says we
> can now that we're on Windows 10+, when building the default locale
> name (see experimental 0002 patch, attached). It initially seemed to
> have the right effect:
>
> The database cluster will be initialized with locale "en-US.UTF-8".
> The default database encoding has accordingly been set to "UTF8".
> The default text search configuration will be set to "english".
>
> Let me try to explain this using the "Beta: Use Unicode UTF-8 for
worldwide language support" option [1].

- Currently in a system with the language settings of "English_United
States" and that option disabled, when executing initdb you get:

The database cluster will be initialized with locale "English_United
States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

And as a test for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR: character with byte sequence 0xc5 0x9f in encoding "UTF8" has no
equivalent in encoding "WIN1252"

We get this error even if the database encoding is UTF8, and is caused by
the tr_tr locales being encoded in WIN1254. We can discuss this in another
thread, and I can propose a patch.

- If we enable the UTF-8 support option, then the same test goes as:

The database cluster will be initialized with locale "English_United
States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

And for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
to_char
---------
şubat
(1 row)

In this case the Windows locales are actually UTF8 encoded.

TL;DR; What I want to show through this example is that Windows ACP is not
modified by setlocale(), it can only be done through the Windows registry
and only in recent releases.

> But then the Turkish i test in contrib/citext/sql/citext_utf8.sql
> failed[1]:
>
> SELECT 'i'::citext = 'İ'::citext AS t;
> t
> ---
> - t
> + f
> (1 row)
>
> This is current state of affairs:

- Windows:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower |
latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ
| İ

- Linux:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower |
latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ
| i

Latin_capital_dotted doesn't have the same lower value.

[1]
https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do

Regards,

Juan José Santamaría Flecha

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-07-22 12:07:46 Re: explain analyze rows=%.0f
Previous Message Aleksander Alekseev 2022-07-22 11:54:22 Re: Pluggable toaster