Re: [SQL] Comparison semantics of CHAR data type

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Thomas Fanghaenel <tfanghaenel(at)salesforce(dot)com>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [SQL] Comparison semantics of CHAR data type
Date: 2013-10-16 18:17:11
Message-ID: 20131016181711.GC18048@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-sql

> > You can see the UTF8 case is fine because \n is considered greater
> > than space, but in the C locale, where \n is less than space, the
> > false return value shows the problem with
> > internal_bpchar_pattern_compare() trimming the string and first
> > comparing on lengths. This is exactly the problem you outline, where
> > space trimming assumes everything is less than a space.
>
> For collations other than C some of those issues that have to do with
> string comparisons might simply be hidden, depending on how strcoll()
> handles inputs off different lengths: If strcoll() applies implicit
> space padding to the shorter value, there won't be any visible
> difference in ordering between bpchar and varchar values. If strcoll()
> does not apply such space padding, the right-trimming of bpchar values
> causes very similar issues even in a en_US collation.
>
> For example, this seems to be the case on OS X:
>
> select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10)
> collate "en_US";
> ?column?
> ----------
> t
> (1 row)
>
> select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C";
> ?column?
> ----------
> t
> (1 row)
>
> select 'ab '::varchar(10) collate "en_US" <
> E'ab\n'::varchar(10) collate "en_US";
> ?column?
> ----------
> f
> (1 row)

The above query returns true on Linux, so there certainly is a
platform-specific difference there. The others are the same.

> select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10)
> collate "C";
> ?column?
> ----------
> f
> (1 row)
>
> So here there's actually not only the same \n/space issue as in the C
> collation (which would go away if the bpchar value weren't trimmed).
> It also shows that there might be slight differences in behavior,
> depending which platform you're running on.
>
> On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> > What matters in general isn't where the characters fall when comparing
> > individual bytes, but how the strings containing them sort according
> > to the applicable collation. That said, my recollection of the spec
> > is that when two CHAR(n) values are compared, the shorter should be
> > blank-padded before making the comparison.
>
> Not necessarily. The SQL Standard actually ties this to the collation
> sequence that is in use. Without a lot of context, this is from
> Subclause 8.2, "<comparison predicate>", General Rule 3)b):
>
> b) If the length in characters of X is not equal to the length in
> characters of Y, then the shorter string is effectively replaced,
> for the purposes of comparison, with a copy of itself that has been
> extended to the length of the longer string by concatenation on the
> right of one or more pad characters, where the pad character is
> chosen based on CS. If CS has the NO PAD characteristic, then the
> pad character is an implementation-dependent character different
> from any character in the character set of X and Y that collates
> less than any string under CS. Otherwise, the pad character is a
> <space>.
>
> In my opinion, that's just a lot of handwaving, to the extent that in
> practice different vendors interpret this clause differently. It seems
> that SQLServer and DB2 do PAD semantics across the board, whereas Oracle
> has uses NO PAD semantics whenever there's a VARCHAR type involved in
> the comparison.
>
> But all that is actually a whole different can of worms, and slightly
> besides the point of my original question. How to properly compare
> strings with different lentgths has been discussed before, see for
> instance the thread in [1]. My intention was not to get that started
> again. As far as I can see, the consensus seems to be that when using
> the C locale, string comparisons should be done using NO PAD semantics.
> (It sure gives some strange semantics if you have varchars with trailing
> spaces, but it's perfectly legal.)
>
> The point is that my testcase deals with strings of the same length.
> Thus, the above clause doesn't really apply. The standard, to my
> understanding, says that fixed-length character values are padded when
> the row is constructed. And once that happens, those spaces become part
> of the value. It's invalid to strip them, unless done explicitly.

Yes, there are three types of comparisons that are important here:

1. 'a'::CHAR(3) < 'a'::CHAR(3)
2. 'a '::CHAR(3) < E'a\n'::CHAR(3)
3. 'a'::CHAR(3) < 'a'::CHAR(4)

You are saying it is only #3 where we can substitute the special
always-lower pad character, while it appears that Postgres does this in
cases #2 and #3.

> > Since we only have the CHAR(n) type to improve compliance with the SQL
> > specification, and we don't generally encourage its use, I think we
> > should fix any non-compliant behavior. That seems to mean that if you
> > take two CHAR values and compare them, it should give the same result
> > as comparing the same two values as VARCHAR using the same collation
> > with the shorter value padded with spaces.
> >
> > So this is correct:
> >
> > test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
> > ?column?
> > ----------
> > t
> > (1 row)
> >
> > ... because it matches:
> >
> > test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
> > ?column?
> > ----------
> > t
> > (1 row)
>
> Again, this touches on the same point as Bruce's example above. Right
> now these two queries might produce identical results on Linux, because
> of the way strcoll() behaves. On OS X you get different results:
>
> select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
> ?column?
> ----------
> t
> (1 row)
>
> select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3)
> collate "en_US";
> ?column?
> ----------
> f
> (1 row)

The above returns true on Linux.

> I have no idea who's right. But doesn't this count as evidence that
> right-trimming pbchars is not even safe with the en_US collation?

Well, one problem in fixing this is that there are existing CHAR()
indexes that embed this behavior, and unless we want to break pg_upgrade
for CHAR() indexes, I am not sure what options we have except to
document this.

Let me also add that we don't propogate the CHAR() length through
any/most/all? function calls and operators, so the length comparison
wouldn't work in all cases anyway, e.g.

SELECT 'a'::CHAR(4) || 'b'::CHAR(4);
?column?
----------
ab
(1 row)

While space truncation is odd, it is predictable.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2013-10-16 18:20:57 Re: Triggers on foreign tables
Previous Message Andres Freund 2013-10-16 18:14:12 Re: removing old ports and architectures

Browse pgsql-sql by date

  From Date Subject
Next Message Gary Stainburn 2013-10-17 11:20:44 Advice - indexing on varchar fields where only last x characters known
Previous Message Craig R. Skinner 2013-10-16 14:41:01 Re: PostgreSQL 9.2 tstzrange null/infinity CONSTRAINT CHECK