BUG #14072: char() and application of locale collation

From: emmanuel(dot)reynard(at)eranea(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #14072: char() and application of locale collation
Date: 2016-04-07 10:28:20
Message-ID: 20160407102820.15196.86869@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14072
Logged by: Emmanuel Reynard
Email address: emmanuel(dot)reynard(at)eranea(dot)com
PostgreSQL version: 9.4.7
Operating system: linux
Description:

Hello,

We've been using a custom locale since we needed to use special ordering for
certain values in a UTF-8 database.

For example, E'\xEC\x82\x80' (E'\uC080') is defined as the first in the
list, before all other possible codepoints/characters.

While it works fine for varchars, properly placing this codepoint before a
space, it doesnt work as we expected in the case of chars.

This seems to be explained by the fact that bcharcmp() (in
src/backend/utils/adt/varchar.c) uses the bcTrueLength of said char(), which
does not count the number of trailing spaces.
Hence a blank char(3) with value '\x202020' is of "true length" 0 and
smaller than any other string, and any locale information will be
disregarded.

This behaviour already had surprised us while calling the length() function
on char(x) with blank pads :
test_utf8=# select length(' '::char(3)) as char_len, char_length('
'::char(3)) as char_charlen, octet_length(' '::char(3)) as char_octetlen,
length(' '::varchar(3)) as varchar_len;
char_len | char_charlen | char_octetlen | varchar_len
----------+--------------+---------------+-------------
0 | 0 | 3 | 3
(1 row)

Is this considered normal? Especially the case with char(x) which can ignore
partially ignore collation if they have trailing blanks.

I can imagine that changing this length behaviour might break a few things
though, is there a workaround besides creating a new custom type?

Cheers.

Emmanuel Reynard

- test case :
create table test_collation (c1 char(3), c2 varchar(3));
insert into test_collation values (' ', ' ');
insert into test_collation values (E'\xEC\x82\x80', E'\xEC\x82\x80');
insert into test_collation values (' ' || E'\uC080', ' ' || E'\uC080');

select c2, c2::bytea from test_collation order by c2 collate "fr_FR.utf8";
c2 | c2
-----+------------
| \x20
삀 | \x20ec8280
삀 | \xec8280

select c2, c2::bytea from test_collation order by c2 collate
"locale_eranea";
c2 | c2
-----+------------
삀 | \xec8280
| \x20
삀 | \x20ec8280

select c1, c1::bytea from test_collation order by c1 collate "fr_FR.utf8";
c1 | c1
------+--------------
| \x202020
삀 | \x20ec828020
삀 | \xec82802020

select c1, c1::bytea from test_collation order by c1 collate
"locale_eranea";
c1 | c1
------+--------------
| \x202020
삀 | \xec82802020
삀 | \x20ec828020

Browse pgsql-bugs by date

  From Date Subject
Next Message Tim Singletary 2016-04-07 11:15:47 Re: BUG #14054: "create index using gist ..." on large table never completes.
Previous Message Miklós Fazekas 2016-04-07 08:26:15 released savepoint blocking further statements