Multibyte problem with COPY FROM [Fwd: Re: postgres 7.2 and unicode]

From: Oliver Elphick <olly(at)lfix(dot)co(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Cc: Craig Sanders <cas(at)taz(dot)net(dot)au>
Subject: Multibyte problem with COPY FROM [Fwd: Re: postgres 7.2 and unicode]
Date: 2002-03-28 14:12:57
Message-ID: 1017324778.1228.389.camel@linda
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have confirmed that this problem occurs for me as well. On trying to
import the 7.1 pg_dump data in the attachment I get

$ psql junk </tmp/linksdb
DROP DATABASE
CREATE DATABASE
You are now connected to database comanagers.
CREATE
CREATE
ERROR: copy: line 1, Unicode >= 0x10000 is not supoorted
lost synchronization with server, resetting connection

The line where the error occurs includes this character sequence
(according to od -xc, with words reversed into string order):

28 45 73 70 61 F1 61 29
( E s p a a )

(I think that F1 is supposed to be n~ in the middle of Espan~a.)

I guess that F1 61 29 is being interpreted as a single character, since
three bytes would be needed for it to be above 0x10000. So for some
reason the Unicode dumped by 7.1 is not the same as the Unicode expected
by 7.2.

Can anyone offer a solution, please?

PostgreSQL has been configured thus:

$ /usr/lib/postgresql/bin/pg_config --configure
--with-template=linux --prefix=/usr/lib/postgresql
--enable-unicode-conversion --with-includes=/usr/include/tcl8.3
--includedir=/usr/include/postgresql --with-python --with-openssl
--with-gnu-ld --disable-rpath --enable-odbc --with-unixodbc
--with-CXX --enable-recode --with-tcl --with-perl --with-pam
--enable-multibyte --enable-debug --enable-syslog --enable-locale
--with-tclconfig=/usr/lib/tcl8.3 --with-tkconfig=/usr/lib/tk8.3
--with-maxbackends=64 --with-pgport=5432

-----Forwarded Message-----

From: Craig Sanders <cas(at)taz(dot)net(dot)au>
To: Oliver Elphick <olly(at)lfix(dot)co(dot)uk>
Subject: Re: postgres 7.2 and unicode
Date: 28 Mar 2002 22:50:51 +1100

On Thu, Mar 28, 2002 at 10:13:35AM +0000, Oliver Elphick wrote:
> I haven't heard of such a problem

i've been searching the web and list archives since i discovered this.
haven't seen anything even remotely related to it.

> Could you extract the data properly before the upgrade? Perhaps the
> pg_dump format is wrong?

yes, the data was dumped properly. there's no problem dumping the data.
the problem occurs when trying to read it back in with COPY (as is done
by the postgres package upgrade scripts).

> Can you (in the new database) insert and extract data through the CGI
> forms as you did before?

i believe so, but i haven't confirmed this for myself yet (i didn't
write the database or the CGI scripts, i just look after the server it's
on).

> Please send me an extract from the dump, showing the creation of the
> database and the table, and some of the dud lines

i have attached a file called linksdb containing the sql code to create
a database, sequence and table, and some sample records. these were
extracted from the db.out file created by the upgrade procedure.

there were several hundred records in the linksdb table, but i've
extracted only the ones with characters between 0xe1 and 0xfa. AFAIK,
not all of them cause a problem. some do. the first line (containing
"Espaa") definitely causes the COPY command to die with "copy: line
190, Unicode >= 0x10000 is not supoorted"

i forced an import by writing a little perl script which used s/// and
tr/// to translate away the bad characters - but that is no
solution...one of the databases is specifically for a web site promoting
multi-lingual web sites for ethnic community groups, so unicode is
essential.

in case it is of use, i have also attached linksdb.orig which is the
complete contents of the linksdb table.

craig

--
craig sanders <cas(at)taz(dot)net(dot)au>

Fabricati Diem, PVNC.
-- motto of the Ankh-Morpork City Watch

Attachment Content-Type Size
linksdb text/plain 4.9 KB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Scherbaum 2002-03-28 14:30:53 Escaping in C-language functions
Previous Message Frank Joerdens 2002-03-28 11:33:21 Bytea vs. BLOB (what's the motivation behind the former?)