pl/perl and utf-8 in sql_ascii databases

From: Christoph Berg <cb(at)df7cb(dot)de>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: pl/perl and utf-8 in sql_ascii databases
Date: 2012-02-09 10:21:16
Message-ID: 20120209102116.GA14429@msgid.df7cb.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

we have a database that is storing strings in various encodings (and
non-encodings, namely the arbitrary byte soup that you might see in
email headers from the internet). For this reason, the database uses
sql_ascii encoding. The columns are text, as most characters are
ascii, so bytea didn't seem the right way to go.

Currently we are on 8.3 and try to upgrade to 9.1, but the plperlu
functions we have are acting up.

Old behavior on 8.3 .. 9.0:

sql_ascii =# create or replace function whitespace(text) returns text
language plperlu as $$ $a = shift; $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace (E'\200'); -- 0x80 is not valid utf-8
whitespace
------------

sql_ascii =# select whitespace (E'\200')::bytea;
whitespace
------------
\x80

New behavior on 9.1.2:

sql_ascii =# select whitespace (E'\200');
ERROR: XX000: Malformed UTF-8 character (fatal) at line 1.
KONTEXT: PL/Perl function "whitespace"
ORT: plperl_call_perl_func, plperl.c:2037

A crude workaround is:

sql_ascii =# create or replace function whitespace_utf8_off(text)
returns text language plperlu as $$ use Encode; $a = shift;
Encode::_utf8_off($a); $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace_utf8_off (E'\200');
whitespace_utf8_off
---------------------
\u0080

sql_ascii =# select whitespace_utf8_off (E'\200')::bytea;
whitespace_utf8_off
---------------------
\xc280

(Note that the workaround is not perfect as the resulting 0x80..0xff
bytes are still tagged to be utf8.)

I think the bug is in plperl_helpers.h:

/*
* Create a new SV from a string assumed to be in the current database's
* encoding.
*/

static inline SV *
cstr2sv(const char *str)
{
SV *sv;
char *utf8_str = utf_e2u(str);

sv = newSVpv(utf8_str, 0);
SvUTF8_on(sv);

pfree(utf8_str);

return sv;
}

In sql_ascii databases, utf_e2u does not do any recoding, but then
SvUTF8_on still marks the string as utf-8, while it isn't.

(Returned values might also need fixing.)

In my view, this is clearly a bug in pl/perl on sql_ascii databases.

Christoph
--
cb(at)df7cb(dot)de | http://www.df7cb.de/

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-02-09 10:25:29 Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Previous Message Heikki Linnakangas 2012-02-09 09:42:12 pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.