plperlu problem with utf8 [REVIEW]

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Alex Hunsaker <badalex(at)gmail(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: plperlu problem with utf8 [REVIEW]
Date: 2011-01-15 21:20:38
Message-ID: 4D320FA6.3000005@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


This is a review of "plperl encoding issues"

https://commitfest.postgresql.org/action/patch_view?id=452

Purpose:
========
Your database uses one encoding, and passes data to perl in the same encoding, which perl is not prepared for (it assumes UTF-8). This patch makes sure data is encoded into UTF-8 before its passed to plperl then converts the response from UTF-8 back to the database encoding for storage.

My test:

ptest2=# create database ptest2 encoding 'EUC_JP' template template0;

I created a simple perl function that reverses the string. I don't know Japanese so I found a tattoo website that had sayings in Japanese... I picked: "I am awesome".


create or replace function preverse(x text) returns text as $$
my $tmp = reverse($_[0]);
return $tmp;
$$ LANGUAGE plperl;

Before the patch:

ptest2=#select preverse('私はよだれを垂らす');

preverse
--------------------
垢蕕眇鬚譴世茲呂篁
(1 row)

It is also possible to generate invalid characters. This function pulls off the last character in the string... assuming its UTF-8

create or replace function plastchar(x text) returns text as $$
my $tmp = substr($_[0], -1);
return $tmp;
$$ LANGUAGE plperl;

ptest2=# select plastchar('私はよだれを垂らす');

ERROR: invalid byte sequence for encoding "EUC_JP": 0xb9
CONTEXT: PL/Perl function "plastchar"

Because the string was not UTF-8, perl got confused and returned an invalid character.

After the patch:
The exact same plperl functions work fine:

ptest2=# select preverse('私はよだれを垂らす');

preverse
--------------------
すら垂をれだよは私
(1 row)

ptest2=# select plastchar('私はよだれを垂らす');

plastchar
-----------

(1 row)

Performance:
============
This is a bug fix, not for performance, however, as noted by the author, many encodings will be very UTF-8'ish and the overhead will be very small. For those encodings that would need converted, you'd need to do the same convert inside your perl function anyway before you could use the data. The processing has just moved from inside your perl func to inside PG.

The Patch:
==========
Applies clean to git head as of January 15 2011. PG built with --enable-cassert and --enable-debug seems to run fine with no errors.

I don't think regression tests cover plperl, so understandable there are no tests in the patch.

There is no manual updates in the patch either, and I think there should be. I think it should be made clear
that data (varchar, text, etc. but not bytea) will be passed to perl as UTF-8, regardless of database encoding. Also that "use utf8;" is always loaded and in use.

Code Review:
============
I am not qualified. Looking through the patch, I'm reminded of the old saying: "Any sufficently advanced perl XS code is indistinguishable from magic" :-)

Other Remarks:
==============
- Yes I know... it was a joke.
- I sure hope this posts to the news group ok
- My terminal (konsole) had a hard time displaying Japanese, so I used psql's \i and \o to read/write files that kwrite show'd/encoded correctly via EUC_JP

Summary:
========
Looks good. Looks needed. Needs manual updates.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marti Raudsepp 2011-01-15 21:25:56 Re: [PATCH] Return command tag 'REPLACE X' for CREATE OR REPLACE statements.
Previous Message Marko Tiikkaja 2011-01-15 21:20:27 Re: Transaction-scope advisory locks