Re: Order changes in PG16 since ICU introduction

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Regina Obe <lr(at)pcorp(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Sandro Santilli <strk(at)kbt(dot)io>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-05-16 19:35:28
Message-ID: 25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/5/23 8:25 PM, Jeff Davis wrote:
> On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
>> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>>> Most of the complaints seem to be complaints about v15 as well, and
>>> while those complaints may be a reason to not make ICU the default,
>>> they are also an argument that we should continue to learn and try
>>> to
>>> fix those issues because they exist in an already-released version.
>>> Leaving it the default for now will help us fix those issues rather
>>> than hide them.
>>>
>>> It's still early, so we have plenty of time to revert the initdb
>>> default if we need to.
>>
>> That's fair enough, but I really think it's important that some
>> energy
>> get invested in providing adequate documentation for this stuff. Just
>> patching the code is not enough.
>
> Attached a significant documentation patch.

> I tried to make it comprehensive without trying to be exhaustive, and I
> separated the explanation of language tags from what collation settings
> you can include in a language tag, so hopefully that's more clear.
>
> I added quite a few examples spread throughout the various sections,
> and I preserved the existing examples at the end. I also left all of
> the external links at the bottom for those interested enough to go
> beyond what's there.

[Personal hat, not RMT]

Thanks -- this is super helpful. A bunch of these examples I had
previously had to figure out by randomly searching blog posts /
trial-and-error, so I think this will help developers get started more
quickly.

Comments (and a lot are just little nits to tighten the language)

Commit message -- typo: "documentaiton"

+ If you see such a message, ensure that the
<symbol>PROVIDER</symbol> and
+ <symbol>LOCALE</symbol> are as you expect, and consider specifying
+ directly as the canonical language tag instead of relying on the
+ transformation.
+ </para>

I'd recommend make this more prescriptive:

"If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
<symbol>LOCALE</symbol> are the expected result. For consistent results
when using the ICU provider, specify the canonical <link
linkend="icu-language-tag">language tag</link> instead of relying on the
transformation."

+ If there is some problem interpreting the locale name, or if it
represents
+ a language or region that ICU does not recognize, a message will
be reported:

This is passive voice, consider:

"If there is a problem interpreting the locale name, or if the locale
name represents a language or region that ICU does not recognize, you'll
see the following error:"

+ <sect3 id="icu-language-tag">
+ <title>Language Tag</title>
+ <para>

Before jumping in, I'd recommend a quick definition of what a language
tag is, e.g.:

"A language tag, defined in BCP 47, is a standardized identifier used to
identify languages in computer systems" or something similar.

(I did find a database that made it simpler to search for these, which
is one issue I've previously add, but I don't think we'd want to link to i)

+ To include this additional collation information in a language tag,
+ append <literal>-u</literal>, followed by one or more

My first question was "what's special about '-u'", so maybe we say:

"To include this additional collation information in a language tag,
append <literal>-u</literal>, which indicates there are additional
collation settings, followed by one or more..."

+ ICU locales are specified as a <link
linkend="icu-language-tag">Language
+ Tag</link>, but can also accept most libc-style locale names
(which will
+ be transformed into language tags if possible).
+ </para>

I'd recommend removing the parantheticals:

ICU locales are specified as a BCP 47 <link
linkend="icu-language-tag">Language
Tag</link>, but can also accept most libc-style locale names. If
possible, libc-style locale names are transformed into language tags.

+ <title>ICU Collation Levels</title>

Nothing to add here other than to say I'm extremely appreciative of this
section. Once upon a time I sunk a lot of time trying to figure out how
all of these levels worked.

+ Sensitivity when determining equality, with
+ <literal>level1</literal> the least sensitive and
+ <literal>identic</literal> the most sensitive. See <xref
+ linkend="icu-collation-levels"/> for details.

This discusses equality sensitivity, but I'm not sure if I understand
that term here. The ICU docs seem to call these "strengths"[1], maybe we
use that term to be consistent with upstream?

+ If set to <literal>upper</literal>, upper case sorts before lower
+ case. If set to <literal>lower</literal>, lower case sorts before
+ upper case. If set to <literal>false</literal>, it depends on the
+ locale.

Suggestion to tighten this up:

"If set to <literal>false</literal>, the sort depends on the rules of
the locale."

+ Defaults may depend on locale. The above table is not meant to be
+ complete. See <xref linkend="icu-external-references"/> for additinal
+ options and details.

Typo: additinal => "additional"

> I didn't add additional documentation for ICU rules. There are so many
> options for collations that it's hard for me to think of realistic
> examples to specify the rules directly, unless someone wants to invent
> a new language. Perhaps useful if working with an interesting text file
> format with special treatment for delimiters?
>
> I asked the question about rules here:
>
> https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at
>
> and got some limited response about addressing sort complaints. That
> sounds reasonable, but a lot of that can also be handled just by
> specifying the right collation settings. Someone who understands the
> use case better could add some more documentation.

I'm not too sure about this one -- from my experience, users want
predictability in sorts, but there are a variety of ways to get that
experience.

Thanks,

Jonathan

[1] https://unicode-org.github.io/icu/userguide/collation/concepts.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2023-05-16 20:00:52 Re: Memory leak from ExecutorState context?
Previous Message Sergey Dudoladov 2023-05-16 18:51:26 Re: Introduce "log_connection_stages" setting.