Re: Unicode FFFF Special Codepoint should always collate high.

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Telford Tendys <psql(at)lnx-bsp(dot)net>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Unicode FFFF Special Codepoint should always collate high.
Date: 2021-06-22 11:17:15
Message-ID: CA+hUKGKnYyS01bVmY7p6ozBgB0ZA=ReSkyWXG5bbHUt6fh-aig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql(at)lnx-bsp(dot)net> wrote:
> The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
> after the non-character codepoint 0xffff, which is supposed to always have
> the highest possible primary weight in all locales, and it is the only
> codepoint available to serve this purpose. The other 4-byte non-character
> codepoints also incorrectly sort lower than real characters.

Not an expert in this subject (and to make things more interesting,
unicode.org has temporarily fallen off the internet, as mentioned in
another thread nearby), but definitely curious... I guess this might
refer to TR35:

U+FFFF: This code point is tailored to have a primary weight higher
than all other characters. This allows the reliable specification of a
range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings
starting with "sch" or equivalent.
U+FFFE: This code point produces a CE with minimal, unique weights
on primary and identical levels. For details see the CLDR Collation
Algorithm above.

Considering the squirrelly definition of noncharacters and their
status as special values for internal use (internal to what?) and not
for data interchange, and the specification of that rule with in the
document controlling markup of collation rules (is it also specified
somewhere else?), is this actually required to work the way you expect
when external users of a conforming collation algorithm sort them?
That's not a rhetorical question, I don't know the answer.

In any case, this is not a PostgreSQL issue, surely. We simply call
out to libc (see your manufacturer for more details) or ICU.

Just for interest, I tried your query on a FreeBSD 13 system (its libc
is based on UCA/CLDR too) and the output matched your RHEL8 example
(using COLLATE "en_US.UTF-8"). I tried it on ICU (using COLLATE
"en-x-icu") and it gave a third output, matching neither RHEL7 nor
RHEL8 but instead putting U&'x\FFFF' last, which seems to match what
you expect:

tmunro=> SELECT ENCODE(x::bytea, 'hex')
FROM (
SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
U&'x\+01FFFF', 'x '])::text AS x
) z
ORDER BY x COLLATE "en-x-icu";
encode
------------
78
7820
782f
78f09fbfbf
78f0afbfbf
78efbfbf
(6 rows)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2021-06-22 13:00:31 Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Previous Message Daniel Gustafsson 2021-06-22 11:10:06 Re: BUG #17063: repmgrd_upstream_reconnect getting more