Unicode FFFF Special Codepoint should always collate high.

From: Telford Tendys <psql(at)lnx-bsp(dot)net>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Unicode FFFF Special Codepoint should always collate high.
Date: 2021-06-22 08:39:18
Message-ID: 20210622083918.GA12063@mail
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Here is an example.

SELECT ENCODE(x::bytea, 'hex')
FROM (
SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
U&'x\+01FFFF', 'x '])::text AS x
) z
ORDER BY x COLLATE "en_US.utf8";

Which gives the following:

encode
------------
78
78efbfbf
78f09fbfbf
78f0afbfbf
7820
782f
(6 rows)

The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
after the non-character codepoint 0xffff, which is supposed to always have
the highest possible primary weight in all locales, and it is the only
codepoint available to serve this purpose. The other 4-byte non-character
codepoints also incorrectly sort lower than real characters.

Tested on Centos 8 with the following package.

postgresql12-server-12.7-2PGDG.rhel8.x86_64

The "locale -a" command shows that "en_US.utf8" is in the system,
although, for this purpose, all UTF-8 locales should do the same thing.

Database default locale is also en_US.utf8 and system /etc/locale.conf has
the same LANG="en_US.UTF-8".

Testing the same thing with a simple file and using the "sort" command
from command line prompt gives similar results.

$ od -tx1c /tmp/test.utf8
0000000 78 0a 78 f0 af bf bf 0a 78 f0 9f bf bf 0a 78 2f
x \n x 360 257 277 277 \n x 360 237 277 277 \n x /
0000020 0a 78 ef bf bf 0a 78 20 0a
\n x 357 277 277 \n x \n
0000031

$ LC_ALL=C sort /tmp/test.utf8 | od -tx1c
0000000 78 0a 78 20 0a 78 2f 0a 78 ef bf bf 0a 78 f0 9f
x \n x \n x / \n x 357 277 277 \n x 360 237
0000020 bf bf 0a 78 f0 af bf bf 0a
277 277 \n x 360 257 277 277 \n
0000031

$ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c
0000000 78 0a 78 ef bf bf 0a 78 f0 af bf bf 0a 78 f0 9f
x \n x 357 277 277 \n x 360 257 277 277 \n x 360 237
0000020 bf bf 0a 78 20 0a 78 2f 0a
277 277 \n x \n x / \n
0000031

You can see that it once again sorts the non-character codepoint lower
than real characters. For reference, these are the Centos 8 glibc packages.

glibc-2.28-151.el8.x86_64
glibc-langpack-en-2.28-151.el8.x86_64

Tested the same thing on sqlfiddle.

http://sqlfiddle.com/#!15/1841e2/2/0

That is using PostgreSQL 9.3 and the system underneath is not made visible.
It gives the sort order that I would expect, special codepoints are sorting
on the high side of real characters.

I then went back to an old Centos 7 machine and tested the same SQL query.
The result is like this:

encode
------------
78
7820
782f
78efbfbf
78f09fbfbf
78f0afbfbf
(6 rows)

That's what I expect the correct answer to be, also a correct match to the
sqlfiddle test. For reference these are the centos packages where I got it
working properly.

postgresql12-server-12.7-1PGDG.rhel7.x86_64

glibc-common-2.17-260.el7_6.6.x86_64
glibc-2.17-260.el7_6.6.x86_64

Same postgres version, but running on an older system. Perhaps this
would implicate a bug introduced in the system itself during the
transition from Centos 7 to Centos 8. The old glibc did not have separate
langpacks and all locales were installed by default.

I'm willing to take this up with Redhat if that's useful.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2021-06-22 09:30:38 Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Previous Message Neil Chen 2021-06-22 07:52:52 Re: BUG #17066: Cache lookup failed when null (unknown) is passed as anycompatiblemultirange