Re: Windows UTF-8, non-ICU collation trouble

From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Windows UTF-8, non-ICU collation trouble
Date: 2020-01-12 00:23:58
Message-ID: 20200112002358.GA2571218@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 11, 2019 at 01:54:47PM +1300, Thomas Munro wrote:
> On Tue, Dec 10, 2019 at 10:29 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > This does suggest some set of CompareString* parameters is free from the
> > problem. If that's right, we could offer collations based on that. (I'm not
> > sure it would be worth offering; ICU may be enough.)
>
> It would be nice to get to the bottom of that (for example, what is
> the relationship between names like "Korean_XXX" and names like
> "ko-KR"?), but I'm unlikely to investigate further (I have enough
> trouble getting N kinds of Unix to do what I want). Generally I like
> the idea of continuing to support and recommend both operating system
> and ICU locales for different use cases. It should be easy to get all
> the software on your system to agree on ordering, which seems like a
> thing you should want as an application designer. The lack of
> versioning is not a problem on Windows (see
> https://commitfest.postgresql.org/26/2351/).

I explored this further in a Windows-specific forum:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/7682927d-d0f3-4278-a0cc-3fefc2ca4b65

Key points relevant to PostgreSQL:
- SQL Server contains its own snapshot of string collation data, so its
sorting is independent of Windows APIs on the machine hosting SQL Server.
The source of the snapshot is long-past Windows API behavior.
- Passing older NLS version values to Windows string comparison functions does
change collation of these particular test strings. On my Windows Server
2016 system, two older NLS versions were available, both of which exhibited
transitivity for these strings.

Looking at my original proposal in that light:

On Thu, Dec 05, 2019 at 10:34:01PM -0800, Noah Misch wrote:
> All I can think to do is issue a warning whenever a CREATE DATABASE or CREATE
> COLLATION combines UTF8 encoding with a locale having this problem. In a
> greenfield, I would forbid affected combinations of encoding and locale. That
> is too harsh, considering the few code points affected and the difficulty of
> changing the collation of existing databases.

That's still the best I can think to do.

> For CREATE DATABASE, all except
> LOCALE=C would trigger the warning. For CREATE COLLATION, ICU locales would
> also not trigger the warning. Hence, the chief workaround is to use LOCALE=C at
> the database level and ICU collations for indexes and operator invocations.

While that is true for Windows Server 2016, it looks like Windows 7 and
Windows Server 2003 wouldn't trigger the warning. (That's of rapidly decaying
interest, of course.) If someone took your "Collation versions on Windows"
patch a step further and allowed "CREATE COLLATION x (LOCALE = 'de-DE',
VERSION = 'a.b,c.d')", one could specify the use of Windows 7 collation order
on Windows Server 2016, removing the problem (and stopping the warning).

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-01-12 01:14:01 Re: [Proposal] Global temporary tables
Previous Message Alexander Korotkov 2020-01-11 22:13:29 Re: [PATCH] Atomic pgrename on Windows