From: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Regina Obe <lr(at)pcorp(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Sandro Santilli <strk(at)kbt(dot)io>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Order changes in PG16 since ICU introduction |
Date: | 2023-05-16 19:35:28 |
Message-ID: | 25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/5/23 8:25 PM, Jeff Davis wrote:
> On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
>> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>>> Most of the complaints seem to be complaints about v15 as well, and
>>> while those complaints may be a reason to not make ICU the default,
>>> they are also an argument that we should continue to learn and try
>>> to
>>> fix those issues because they exist in an already-released version.
>>> Leaving it the default for now will help us fix those issues rather
>>> than hide them.
>>>
>>> It's still early, so we have plenty of time to revert the initdb
>>> default if we need to.
>>
>> That's fair enough, but I really think it's important that some
>> energy
>> get invested in providing adequate documentation for this stuff. Just
>> patching the code is not enough.
>
> Attached a significant documentation patch.
> I tried to make it comprehensive without trying to be exhaustive, and I
> separated the explanation of language tags from what collation settings
> you can include in a language tag, so hopefully that's more clear.
>
> I added quite a few examples spread throughout the various sections,
> and I preserved the existing examples at the end. I also left all of
> the external links at the bottom for those interested enough to go
> beyond what's there.
[Personal hat, not RMT]
Thanks -- this is super helpful. A bunch of these examples I had
previously had to figure out by randomly searching blog posts /
trial-and-error, so I think this will help developers get started more
quickly.
Comments (and a lot are just little nits to tighten the language)
Commit message -- typo: "documentaiton"
+ If you see such a message, ensure that the
<symbol>PROVIDER</symbol> and
+ <symbol>LOCALE</symbol> are as you expect, and consider specifying
+ directly as the canonical language tag instead of relying on the
+ transformation.
+ </para>
I'd recommend make this more prescriptive:
"If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
<symbol>LOCALE</symbol> are the expected result. For consistent results
when using the ICU provider, specify the canonical <link
linkend="icu-language-tag">language tag</link> instead of relying on the
transformation."
+ If there is some problem interpreting the locale name, or if it
represents
+ a language or region that ICU does not recognize, a message will
be reported:
This is passive voice, consider:
"If there is a problem interpreting the locale name, or if the locale
name represents a language or region that ICU does not recognize, you'll
see the following error:"
+ <sect3 id="icu-language-tag">
+ <title>Language Tag</title>
+ <para>
Before jumping in, I'd recommend a quick definition of what a language
tag is, e.g.:
"A language tag, defined in BCP 47, is a standardized identifier used to
identify languages in computer systems" or something similar.
(I did find a database that made it simpler to search for these, which
is one issue I've previously add, but I don't think we'd want to link to i)
+ To include this additional collation information in a language tag,
+ append <literal>-u</literal>, followed by one or more
My first question was "what's special about '-u'", so maybe we say:
"To include this additional collation information in a language tag,
append <literal>-u</literal>, which indicates there are additional
collation settings, followed by one or more..."
+ ICU locales are specified as a <link
linkend="icu-language-tag">Language
+ Tag</link>, but can also accept most libc-style locale names
(which will
+ be transformed into language tags if possible).
+ </para>
I'd recommend removing the parantheticals:
ICU locales are specified as a BCP 47 <link
linkend="icu-language-tag">Language
Tag</link>, but can also accept most libc-style locale names. If
possible, libc-style locale names are transformed into language tags.
+ <title>ICU Collation Levels</title>
Nothing to add here other than to say I'm extremely appreciative of this
section. Once upon a time I sunk a lot of time trying to figure out how
all of these levels worked.
+ Sensitivity when determining equality, with
+ <literal>level1</literal> the least sensitive and
+ <literal>identic</literal> the most sensitive. See <xref
+ linkend="icu-collation-levels"/> for details.
This discusses equality sensitivity, but I'm not sure if I understand
that term here. The ICU docs seem to call these "strengths"[1], maybe we
use that term to be consistent with upstream?
+ If set to <literal>upper</literal>, upper case sorts before lower
+ case. If set to <literal>lower</literal>, lower case sorts before
+ upper case. If set to <literal>false</literal>, it depends on the
+ locale.
Suggestion to tighten this up:
"If set to <literal>false</literal>, the sort depends on the rules of
the locale."
+ Defaults may depend on locale. The above table is not meant to be
+ complete. See <xref linkend="icu-external-references"/> for additinal
+ options and details.
Typo: additinal => "additional"
> I didn't add additional documentation for ICU rules. There are so many
> options for collations that it's hard for me to think of realistic
> examples to specify the rules directly, unless someone wants to invent
> a new language. Perhaps useful if working with an interesting text file
> format with special treatment for delimiters?
>
> I asked the question about rules here:
>
> https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at
>
> and got some limited response about addressing sort complaints. That
> sounds reasonable, but a lot of that can also be handled just by
> specifying the right collation settings. Someone who understands the
> use case better could add some more documentation.
I'm not too sure about this one -- from my experience, users want
predictability in sorts, but there are a variety of ways to get that
experience.
Thanks,
Jonathan
[1] https://unicode-org.github.io/icu/userguide/collation/concepts.html
From | Date | Subject | |
---|---|---|---|
Next Message | Melanie Plageman | 2023-05-16 20:00:52 | Re: Memory leak from ExecutorState context? |
Previous Message | Sergey Dudoladov | 2023-05-16 18:51:26 | Re: Introduce "log_connection_stages" setting. |