Re: Crash report for some ICU-52 (debian8) COLLATE and work_mem values

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Crash report for some ICU-52 (debian8) COLLATE and work_mem values
Date: 2017-08-18 19:36:16
Message-ID: CAH2-Wzn0idkTAqz5xpSC_AiiyBVaZTKMQfzqsyQPkxh8TSP0yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Thu, Aug 17, 2017 at 6:22 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> My argument for doing this is very simple: ICU/CLDR/BCP 47 provides
> stability guarantees for locales, not collations [1]. For example, as
> we discussed, de_BE didn't actually go away -- it just stopped being a
> distinct collation within ICU, for reasons that are implementation
> defined.

I have data to back this up. I attach 2 files: one is a listing of
locale XML files from within CLDR 1.9's ./common/main/, dating from
December 2010, and the other is a similar listing for CLDR 3.1, dating
from April 2017. This roughly covers every ICU version we'll support
on day 1. The listing is sorted alphabetically, to ease comparison.

Summary:

$ cat locale_list_cldr-19.txt | wc -l
605
$ cat locale_list_cldr-31.txt | wc -l
722
$ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep
"^-[a-zA-Z]" | wc -l
144
$ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep
"^+[a-zA-Z]" | wc -l
261

So, there have been 144 locales removed in that time, and 261 added.
My proposal to standardize on using all locales ICU makes available,
rather than all behaviorally distinct collations, clearly does not
ensure perfect stability. It does actually work pretty well in
practice, though. The number 144 is misleadingly high. If you actually
look at what went away in detail, it looks like there is a lot of
script variants of the same language/country code. Plus, the changes
themselves are non-technical in nature.

The churn seems to be in part due to geopolitical changes, such as 5
years [1] passing after the dissolution of Serbia and Montenegro.
However, it is mostly due to switching from ISO 639-1 to ISO 639-3
codes in cases where a finer distinction about cultural preferences
needed to be made (note that they still only list *macro*
language/region/script combinations as distinct collations). For
example, Kurdish went from being "ku-" to 3 different macro languages:
"ckb-" (Central Kurdish), "kmr-" (Northern Kurdish), and "sdh-"
(Southern Kurdish). Wikipedia says of ISO 639-3: "Because it provides
comprehensive language coverage, giving equal opportunity for all
languages, and because of its wide adoption in information
technologies, ISO 639-3 provides an important technology component
addressing the digital divide problem". We can hope that it will be
the last such revision ever needed, because this digital divide
problem is solved once and for all, at least as far as these standards
go.

CLDR prefers to use ISO 639-1 language codes for compatibility [2],
which is why the language codes are mostly still 2 letters (ISO
639-1). "en" did not change to "eng", because there was no cultural
reason to do so, and thus there was a 1:1 mapping between "en" and
"eng" anyway. Regions/countries will only change due to rare
geopolitical events.

In summary, I think that these changes are fairly low impact in
practice, and are entirely explainable by political changes and
cultural controversies. They really are minimal, because CLDR/ICU
really does take the stability of collation names seriously. We can
and should ensure that locales like "de_BE" are available in every ICU
version, because that is an inexcusable technical oversight, and is
not due to a cultural or political issue.

[1] http://cldr.unicode.org/index/process/cldr-data-retention-policy
[2] http://www.unicode.org/reports/tr35/#unicode_language_subtag_validity
--
Peter Geoghegan

Attachment Content-Type Size
locale_list_cldr-31.txt text/plain 6.8 KB
locale_list_cldr-19.txt text/plain 5.7 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2017-08-19 02:47:49 Re: Crash report for some ICU-52 (debian8) COLLATE and work_mem values
Previous Message Amit Kapila 2017-08-18 10:52:04 Re: [HACKERS] [postgresql 10 beta3] unrecognized node type: 90

Browse pgsql-hackers by date

  From Date Subject
Next Message Vesa-Matti J Kari 2017-08-18 19:37:30 Re: HISTIGNORE for psql
Previous Message Vesa-Matti J Kari 2017-08-18 19:15:28 Re: HISTIGNORE for psql