Perl DBI converts UTF-8 again to UTF-8 before sending it to the server

From: Matthias Apitz <guru(at)unixarea(dot)de>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Perl DBI converts UTF-8 again to UTF-8 before sending it to the server
Date: 2019-10-04 13:25:29
Message-ID: 20191004132529.GA2871@c720-r342378
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hello,

We're facing the problem that UTF-8 data to be INSERT'ed into a CHAR
column is converted again to UTF-8, assuming it's ISO. I have here a
small Perl program which can be used for testing:

#!/usr/local/bin/perl

use utf8;

my $PGDB = 'dbi:Pg:dbname=newsisis;host=127.0.0.1';
my $PGDB_USER = 'sisis';
my $SQL_INSERT = 'INSERT INTO dbctest (tstchar25, tstint) VALUES (?, ?)';

use DBI;

my $dbh = DBI->connect($PGDB, $PGDB_USER)
|| die "Couldn't connect to $PGDB as user $PGDB_USER: $DBI::errstr\n";

print "DBI is version $DBI::VERSION, DBD::Pg is version $DBD::Pg::VERSION\n";

$dbh->do("SET client_encoding TO UTF8");

$dbh->{pg_enable_utf8} = 1;

my $sth = $dbh->prepare( $SQL_INSERT )
|| die "Can't prepare insert statement $SQL_INSERT: $DBI::errstr";

my $text = "\xc3\xa4";
print "text: ".$text."\n";

$sth->execute($text, 1) or die $sth->errstr, "\n";

Running this, gives the following output:

$ ./utf8.pl
DBI is version 1.642, DBD::Pg is version 3.8.0
text: ä

$ ./utf8.pl | od -tx1
0000000 44 42 49 20 69 73 20 76 65 72 73 69 6f 6e 20 31
0000020 2e 36 34 32 2c 20 44 42 44 3a 3a 50 67 20 69 73
0000040 20 76 65 72 73 69 6f 6e 20 33 2e 38 2e 30 0a 74
0000060 65 78 74 3a 20 c3 a4 0a
^^^^^
(this shows that the var '$text' contains \xc3a4, an UTF-8 'ä'
(a-Umlaut).

If we now look into the table in hex we see:

$ printf "select tstchar25::bytea from dbctest ;\n" | psql -Usisis -dnewsisis
tstchar25
----------------------------------------------------------
\xc383c2a42020202020202020202020202020202020202020202020
(1 Zeile)

i.e. the 'ä' is converted again, like this cmd would do:

$ printf 'ä' | iconv -f iso-8859-1 -t utf-8 | od -tx1
0000000 c3 83 c2 a4

and ofc it's looking broken:

$ printf "select tstchar25 from dbctest ;\n" | psql -Usisis -dnewsisis
tstchar25
---------------------------
ä
(1 Zeile)

I watched the trafic between the client ./utf8.pl and the server with
strace and it's sent broken already to the server:

...
write(1, "text: \303\244\n", 9) = 9
sendto(3, "P\0\0\0G\0INSERT INTO dbctest (tstchar25, tstint) VALUES ($1, $2)\0\0\2\0\0\0\0\0\0\0\0B\0\0\0\33\0\0\0\0\0\2\0\0\0\4\303\203\302\244\0\0\0\0011\0\1\0\0D\0\0\0\6P\0E\0\0\0\t\0\0\0\0\0S\0\0\0\4", 122, MSG_NOSIGNAL, NULL, 0) = 122
...

see the sequence '\303\203\302\244' in octal values.

What is the problem with DBI? Thanks

matthias

--
Matthias Apitz, ✉ guru(at)unixarea(dot)de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

3. Oktober! Wir gratulieren! Der Berliner Fernsehturm wird 50
aus: https://www.jungewelt.de/2019/10-02/index.php

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2019-10-04 13:35:05 Re: BitmapAnd on correlated column?
Previous Message greigwise 2019-10-04 13:19:36 Re: BitmapAnd on correlated column?