BUG #1859: 3-octet private use UTF8 chars reported as identical

From: "Nathan Culwell-Kanarek" <nculwell(at)wisc(dot)edu>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #1859: 3-octet private use UTF8 chars reported as identical
Date: 2005-09-01 23:20:49
Message-ID: 20050901232049.3E621F14A7@svr2.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 1859
Logged by: Nathan Culwell-Kanarek
Email address: nculwell(at)wisc(dot)edu
PostgreSQL version: 8.0.3
Operating system: Fedora Core Linux 4
Description: 3-octet private use UTF8 chars reported as identical
Details:

-------------------------------------
Bug description:
-------------------------------------

Our project is using a specialized Unicode character set that uses some
characters from the private use area, displayed with the Aborginal Serif
font (the font can be freely downloaded at http://www.languagegeek.com/), to
represent the Menominee language. We've run into a problem, which is that
the PostgreSQL backend is interpreting 4 of the private use characters as
being equivalent. A SELECT query using the equality character indicates that
they are equal, and when there is a field with a uniqueness constraint then
it cannot have 2 values that differ only by a minimal pair of the private
use characters.

Note: The private use characters in question are 3 octets long in UTF-8
encoding, which is what we're using.

-------------------------------------
Program to reproduce the bug:
-------------------------------------

#!env perl

# Create a test database.

system "createdb -E unicode bugdb";

# Define constants for the private use Unicode
# characters. These are defined by explicitly
# encoding the UTF-8 octets that compose the
# characters we need. From here on the characters
# are referred to through these variables.

$UTF8_AE = "\xEF\x98\xB6"; # AE
$UTF8_ae = "\xEF\x98\xB7"; # ae
$UTF8_AE_mac = "\xEF\x98\xB8"; # AE-macron
$UTF8_ae_mac = "\xEF\x98\xB9"; # ae-macron

sub exec_sql {
my $sql = shift;
system 'psql -d bugdb -e -c "' . $sql . '"';
}

# Make SELECT queries to compare strings that should be
# distinct; the expected result for each of these
# queries is FALSE.

exec_sql("SELECT '$UTF8_AE' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae'");
exec_sql("SELECT '$UTF8_AE' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_AE_mac'");
exec_sql("SELECT '$UTF8_ae' = '$UTF8_ae_mac'");
exec_sql("SELECT '$UTF8_AE_mac' = '$UTF8_ae_mac'");

# Make a table with a VARCHAR field as the primary key
# (thus the field has UNIQUE and NOT NULL constrants).
# Add 2 distinct strings, it should be valid to have
# both values in the field in different records.

exec_sql("CREATE TABLE mytable (word VARCHAR PRIMARY KEY)");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae}q')");
exec_sql("INSERT INTO mytable (word) VALUES ('n${UTF8_ae_mac}q')");

system "dropdb bugdb";

-------------------------------------
The output from the above program:
-------------------------------------

$ perl makebug.pl
CREATE DATABASE
SELECT '' = ''
?column?
----------
t
(1 row)

SELECT '' = ''
?column?
----------
t
(1 row)

SELECT '' = ''
?column?
----------
t
(1 row)

SELECT '' = ''
?column?
----------
t
(1 row)

SELECT '' = ''
?column?
----------
t
(1 row)

SELECT '' = ''
?column?
----------
t
(1 row)

CREATE TABLE mytable (word VARCHAR PRIMARY KEY)
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"mytable_pkey" for table "mytable"
CREATE TABLE
INSERT INTO mytable (word) VALUES ('nq')
INSERT 361753 1
INSERT INTO mytable (word) VALUES ('nq')
ERROR: duplicate key violates unique constraint "mytable_pkey"
DROP DATABASE

-------------------------------
Platform information:
-------------------------------

The system is Fedora Core Linux 4 (kernel 2.6.12-1.1398_FC4, glibc 2.3.5-10)
running PostgreSQL 8.0.3 on an Intel processor

(Pentium 4 I believe) with 512MB of RAM.

$ uname -a
Linux ling.wisc.edu 2.6.12-1.1398_FC4 #1 Fri Jul 15 00:52:32 EDT 2005 i686
i686 i386 GNU/Linux
$ rpm -qa | grep postgresql
postgresql-server-8.0.3-1
postgresql-devel-8.0.3-1
postgresql-odbc-08.00.0100-1
postgresql-python-8.0.3-1
postgresql-libs-8.0.3-1
postgresql-tcl-8.0.3-1
postgresql-8.0.3-1
$ postmaster --version
postmaster (PostgreSQL) 8.0.3
[nate(at)ling:~]
$ psql --version
psql (PostgreSQL) 8.0.3
contains support for command-line editing
$ rpm -q glibc
glibc-2.3.5-10

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Martijn de Munnik 2005-09-02 01:04:19 Compile with Solaris 10 AMD 64bit Sun CC
Previous Message D.J. Kniep 2005-09-01 15:15:53 BUG #1858: setting search path in select doesn't (always) work