Re: [SQL] Comparison semantics of CHAR data type

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Thomas Fanghaenel <tfanghaenel(at)salesforce(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [SQL] Comparison semantics of CHAR data type
Date: 2013-10-11 19:44:37
Message-ID: 20131011194437.GA3614@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-sql

On Sun, Sep 22, 2013 at 08:51:26PM -0400, Thomas Fanghaenel wrote:
> I was wondering about the proper semantics of CHAR comparisons in some corner
> cases that involve control characters with values that are less than 0x20
> (space).
>
> Consider the following testcase:
>
> ===
> create table t (a int, b char(10));
>
> insert into t values (1, 'foo');
> insert into t values (2, 'foo ');
> insert into t values (3, E'foo\t');
> insert into t values (4, E'foo\n');
> insert into t values (5, E'foo \n');
> insert into t values (6, 'foobar');
>
> select * from t order by b;
> ===
>
> What's the proper order of these string values in the CHAR domain? The way I
> interpret the SQL Standard (and assuming that \t and \n collate lower than a
> space), it's supposed to be this:
>
> (3) < (4) < (5) < (1) <= (2) < (6)
>
> Postgres comes up with this:
>
> (1) <= (2) < (3) < (4) < (5) < (6)
>
> The reason is that the bpchar functions that implement the relative comparison
> operators for CHAR(n) effectively strip trailing whitespaces before doing the
> comparison. One might argue that doing this is not correct. The standard
> seems to mandate that all CHAR(n) values are actually considered to be of width
> n, and that trailing spaces are indeed relevant for comparison. In other
> words, stripping them would only be possible if it can be guaranteed that there
> are no characters in the character set that collate lower than a space.
>
> Any thoughts on this? I searched the mailing list archives, but couldn't find
> any relevant discussion. There were plenty of threads that argue whether or
> not it's semantically correct to strip trailing spaces from CHAR(n) values, but
> the issue of characters collating below a space does not seem to have brought
> up in any of those discussions before.

[I am moving this thread to hackers because I think it needs internals
review.]

You have some good questions here, though there are two interrelated
things going on here. First is collation, and the second is the
trimming of spaces from char() comparisons. Let's look at collation
first:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> SELECT 'a c' UNION ALL SELECT 'ab' ORDER BY 1;
?column?
----------
ab
a c
(2 rows)

You will notice spaces are not considered important in a UTF8 collation.
If we do this in the C collation, we get a different result:

test=> CREATE DATABASE test2 WITH LC_COLLATE = 'C' TEMPLATE template0;
CREATE DATABASE
test=> \c test2
You are now connected to database "test2" as user "postgres".
test2=> SELECT 'a c' UNION ALL SELECT 'ab' ORDER BY 1;
?column?
----------
a c
ab
(2 rows)

Also, when using ORDER BY, it isn't clear if the values are ordered that
way due to being greater/less-than, or just randomly. For example, I
found your example above gave different ordering if I inserted the
values differently, using a UTF8 collation.

Let me use comparisons instead using UTF8 for clarity:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select 'a c' < 'ab';
?column?
----------
f
(1 row)

and "C":

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select 'a c' < 'ab';
?column?
----------
t
(1 row)

Now, let's look at ASCII characters less than space, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'a\nb' < E'a b';
?column?
----------
f
(1 row)

and in C:

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'a\nb' < E'a b';
?column?
----------
t
(1 row)

You can see that newline is greater than space in UTF8, but not in C.

Now, on to the trailing space issue using the default TEXT value for
strings, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'ab\n' < E'ab ';
?column?
----------
f
(1 row)

then in "C":

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'ab\n' < E'ab ';
?column?
----------
t
(1 row)

This matches the \n/space issue we saw above. Now, here is where CHAR()
starts to show the unusual behavior you saw, first in UTF8:

test=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)

test=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?
----------
f
(1 row)

then in C:

test2=> SHOW lc_collate;
lc_collate
------------
C
(1 row)

test2=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
?column?
----------
f
(1 row)

You can see the UTF8 case is fine because \n is considered greater than
space, but in the C locale, where \n is less than space, the false
return value shows the problem with internal_bpchar_pattern_compare()
trimming the string and first comparing on lengths. This is exactly the
problem you outline, where space trimming assumes everything is less
than a space.

I am sorry for this long email, but I would be interested to see what
other hackers think about this issue.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2013-10-11 20:03:02 Re: buildfarm failures on smew and anole
Previous Message Robert Haas 2013-10-11 19:33:21 buildfarm failures on smew and anole

Browse pgsql-sql by date

  From Date Subject
Next Message Kevin Grittner 2013-10-11 20:58:50 Re: [SQL] Comparison semantics of CHAR data type
Previous Message luckyjackgao 2013-10-11 07:24:59 Re: PostgreSQL 9.2 - pg_dump out of memory when backuping a database with 300000000 large objects