Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marc-Olaf Jaschke <marc-olaf(dot)jaschke(at)s24(dot)com>, Postgres-Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Date: 2016-03-23 02:33:49
Message-ID: CAM3SWZSzE13i=9pDseTn9XzE21kQ_qHnb7JOkDNUs3akH=jswQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Mar 22, 2016 at 3:06 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Well, if we implement a compatibility GUC that shuts off our
> dependency on strxfrm(), people can go back to having 9.5 be no more
> broken than 9.4 was. I vote we do that and go home.

I don't have a problem with that idea, but I fear "no more broken than
9.4 was" might be a very low bar for certain systems and collations.
Abbreviated key may have simply unmasked the problem in some cases.

Consider:

[vagrant(at)localhost ~]$ LC_COLLATE=en_us sort strings.txt <-- correct
x xx
x xx"
xxx
xxx"
[vagrant(at)localhost ~]$ LC_COLLATE=de_DE sort strings.txt <-- wrong
xxx
xxx"
x xx
x xx"
[vagrant(at)localhost ~]$ ./strxfrm-binary de_DE.UTF-8 'xxx' 'x xx'
"xxx" -> 2323230108080801020202 (11 bytes)
"x xx" -> 2323230108080801020202010235 (14 bytes)
strcmp(arg1, arg2) result: -1
strcoll(arg1, arg2) result: 6

My concern was not merely "academic" (i.e. it was not limited in scope
to things that don't make B-Tree indexes corrupt). Pretty sure that we
need to start thinking of this as a problem with strcoll() that
strxfrm() does not have for more fundamental reasons, because
strcoll() says that the first string in the de_DE sorted list is
*greater* than the third string. That's wrong, and not just because
strxfrm() gives an intuitively correct answer -- it's wrong
specifically because the transitive law has been broken.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2016-03-23 02:41:43 Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Previous Message Stephen Frost 2016-03-23 01:49:56 Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-03-23 02:41:43 Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Previous Message David Steele 2016-03-23 02:11:12 Re: WAL logging problem in 9.4.3?