Re: String comparison and the SQL standard

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: String comparison and the SQL standard
Date: 2013-01-17 15:35:50
Message-ID: 8761.1358436950@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at> writes:
> While researching a problem with a different database system,
> I came across the following in the SQL standard ISO/IEC 9075-2:2003,
> Section 8.2 (<comparison predicate>), General Rules:

> 3) The comparison of two character strings is determined as follows:
> a) Let CS be the collation as determined by Subclause 9.13,
> "Collation determination", for the declared types of the
> two character strings.
> b) If the length in characters of X is not equal to the length
> in characters of Y, then the shorter string is effectively
> replaced, for the purposes of comparison, with a copy of itself
> that has been extended to the length of the longer string by
> concatenation on the right of one or more pad characters,
> where the pad character is chosen based on CS.
> If CS has the NO PAD characteristic, then the pad character is
> an implementation-dependent character different from
> any character in the character set of X and Y that collates
> less than any string under CS.
> Otherwise, the pad character is a <space>.

The PAD case is specifying the way that CHAR(n) comparison should work.
(We don't expose the PAD/NO PAD distinction in any other way than
CHAR vs VARCHAR/TEXT types.)

AFAICS, the NO PAD case is ignorable BS: they are basically specifying
implementation not semantics there, and in a way that is totally
brain-dead. There isn't necessarily any such character as the one they
blithely posit. Moreover, the whole description seems to assume that
string comparison is single-pass left-to-right, which has little to do
with any modern collation specification. We just rely on strcmp to
decide that shorter strings are "less" than longer ones, which is the
point of this spec AFAICT.

Note that we don't actually do CHAR(n) comparison like that either,
but instead choose to strip trailing spaces before the comparison.

In any case, the most significant word in that whole paragraph is
"effectively", which means you can do it however you want as long
as you get an equivalent comparison result.

> That would effectively mean that 'a'='a ' is TRUE for
> all character string types.

In the PAD case, yes. Else no.

> Of the DBMS I tested, Microsoft SQL Server and MySQL gave me
> that very result, while PostgreSQL and Oracle gave me FALSE.

This probably has more to do with what these systems think the
data type of an undecorated literal is, than with whether they do
trailing-space-insensitive comparison all the time.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2013-01-17 17:35:51 Re: Libpq and multithreading
Previous Message Igor Neyman 2013-01-17 15:12:42 Re: speeding up a join query that utilizes a view