Re: plperlu problem with utf8

From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-17 02:39:54
Message-ID: AANLkTi=wEQAuw8V+YNB9XyeVHJxtYY0OUat_saB=HjP-@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 8, 2010 at 14:15, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> On Wed, 8 Dec 2010, David E. Wheeler wrote:
>
>> On Dec 8, 2010, at 8:13 AM, Oleg Bartunov wrote:
>>
>>> adding utf8::decode($_[0]) solves the problem:
>> Hrm. Ideally all strings passed to PL/Perl functions would be decoded.
>
> yes, this is what I expected.

Erm... no. The in and out from perl AFAICT works fine (minus a caveat
I found, see the end of the mail).

The problem here is you have the url encoded utf8 bytes "%C3%A9".
URL::Encode basically does chr(hex("c3")) and chr(hex("a9"));. Perl,
generally, will treat that as two separate unicode code points. So
you end up with two characters (one with a code point of 0xc3, the
other with 0xa9) instead of the one character you expect. If you want
\xc3\xa9 to be treated as a utf8 byte sequence, you need to tell perl
those bytes are utf8 by decoding it. Heck for all we know instead of
it being a utf8 sequence, it could have been a utf16 sequence.

You might argue this is a bug with URI::Escape as I *think* all uri's
will be utf8 encoded. Anyway, I think postgres is doing the right
thing here.

In playing around I did find what I think is a postgres bug. Perl has
2 ways it can store things internally. per perldoc perlunicode:

Using Unicode in XS
... What the "UTF8" flag means is that the sequence of octets in the
representation of the scalar is the sequence of UTF-8 encoded code
points of the characters of a string. The "UTF8" flag being off means
that each octet in this representation encodes a single character with
code point 0..255 within the string.

Postgres always prints whatever the internal representation happens to
be ignoring the UTF8 flag and the server encoding.

# create or replace function chr(i int, i2 int) returns text as $$
return chr($_[0]).chr($_[1]); $$ language plperlu;
CREATE FUNCTION

# show server_encoding;
server_encoding
-----------------
SQL_ASCII

# SELECT length(chr(128, 33));
length
--------
2

# SELECT length(chr(128, 333));
length
--------
4

Grr that should error out with "Invalid server encoding", or worst
case should return a length of 3 (it utf8 encoded 128 into 2 bytes
instead of leaving it as 1). In this case the 333 causes perl store
it internally as utf8.

Now on a utf8 database:

# show server_encoding;
server_encoding
-----------------
UTF8

# SELECT length(chr(128, 33));
ERROR: invalid byte sequence for encoding "UTF8": 0x80
CONTEXT: PL/Perl function "chr"

# SELECT length(chr(128, 333));
CONTEXT: PL/Perl function "chr"
length
--------
2

Same thing here, we just end up using the internal format. In one
case it works in the other it does not. The main point being, most of
the time it *happens* to work. But its really just by chance.

I think what we should do is use SvPVutf8() when we are UTF8 instead
of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string
in perls current internal format (maybe unicode, maybe a utf8 byte
sequence). While SvPVutf8 will always give us utf8 (may or may not be
valid!) encoded string.

Something like the attached. Thoughts? Im not very happy with the non
utf8 case-- The elog(ERROR, "invalid byte sequence") is a total
cop-out yes. But I did not see a good solution short of hand rolling
our own version of sv_utf8_downgrade(). Is it worth it?

Attachment Content-Type Size
plperl_encoding.patch text/x-patch 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru HANADA 2010-12-17 02:49:31 Re: SQL/MED - file_fdw
Previous Message Hitoshi Harada 2010-12-17 02:31:51 Re: range intervals in window function frames