Quick Links

Unicode FFFF Special Codepoint should always collate high.

From:	Telford Tendys <psql(at)lnx-bsp(dot)net>
To:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Unicode FFFF Special Codepoint should always collate high.
Date:	2021-06-22 08:39:18
Message-ID:	20210622083918.GA12063@mail
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Here is an example.

SELECT ENCODE(x::bytea, 'hex')
FROM (
SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
U&'x\+01FFFF', 'x '])::text AS x
) z
ORDER BY x COLLATE "en_US.utf8";

Which gives the following:

encode
------------
78
78efbfbf
78f09fbfbf
78f0afbfbf
7820
782f
(6 rows)

The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
after the non-character codepoint 0xffff, which is supposed to always have
the highest possible primary weight in all locales, and it is the only
codepoint available to serve this purpose. The other 4-byte non-character
codepoints also incorrectly sort lower than real characters.

Tested on Centos 8 with the following package.

postgresql12-server-12.7-2PGDG.rhel8.x86_64

The "locale -a" command shows that "en_US.utf8" is in the system,
although, for this purpose, all UTF-8 locales should do the same thing.

Database default locale is also en_US.utf8 and system /etc/locale.conf has
the same LANG="en_US.UTF-8".

Testing the same thing with a simple file and using the "sort" command
from command line prompt gives similar results.

$ od -tx1c /tmp/test.utf8
0000000 78 0a 78 f0 af bf bf 0a 78 f0 9f bf bf 0a 78 2f
x \n x 360 257 277 277 \n x 360 237 277 277 \n x /
0000020 0a 78 ef bf bf 0a 78 20 0a
\n x 357 277 277 \n x \n
0000031

$ LC_ALL=C sort /tmp/test.utf8 | od -tx1c
0000000 78 0a 78 20 0a 78 2f 0a 78 ef bf bf 0a 78 f0 9f
x \n x \n x / \n x 357 277 277 \n x 360 237
0000020 bf bf 0a 78 f0 af bf bf 0a
277 277 \n x 360 257 277 277 \n
0000031

$ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c
0000000 78 0a 78 ef bf bf 0a 78 f0 af bf bf 0a 78 f0 9f
x \n x 357 277 277 \n x 360 257 277 277 \n x 360 237
0000020 bf bf 0a 78 20 0a 78 2f 0a
277 277 \n x \n x / \n
0000031

You can see that it once again sorts the non-character codepoint lower
than real characters. For reference, these are the Centos 8 glibc packages.

glibc-2.28-151.el8.x86_64
glibc-langpack-en-2.28-151.el8.x86_64

Tested the same thing on sqlfiddle.

http://sqlfiddle.com/#!15/1841e2/2/0

That is using PostgreSQL 9.3 and the system underneath is not made visible.
It gives the sort order that I would expect, special codepoints are sorting
on the high side of real characters.

I then went back to an old Centos 7 machine and tested the same SQL query.
The result is like this:

encode
------------
78
7820
782f
78efbfbf
78f09fbfbf
78f0afbfbf
(6 rows)

That's what I expect the correct answer to be, also a correct match to the
sqlfiddle test. For reference these are the centos packages where I got it
working properly.

postgresql12-server-12.7-1PGDG.rhel7.x86_64

glibc-common-2.17-260.el7_6.6.x86_64
glibc-2.17-260.el7_6.6.x86_64

Same postgres version, but running on an older system. Perhaps this
would implicate a bug introduced in the system itself during the
transition from Centos 7 to Centos 8. The old glibc did not have separate
langpacks and all locales were installed by default.

I'm willing to take this up with Redhat if that's useful.

Responses

Re: Unicode FFFF Special Codepoint should always collate high. at 2021-06-22 11:17:15 from Thomas Munro

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2021-06-22 09:30:38	Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Previous Message	Neil Chen	2021-06-22 07:52:52	Re: BUG #17066: Cache lookup failed when null (unknown) is passed as anycompatiblemultirange