Re: plperlu problem with utf8

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alex Hunsaker <badalex(at)gmail(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-17 03:24:46
Message-ID: C9982425-2453-479A-88FB-D12B6F20839B@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:

> You might argue this is a bug with URI::Escape as I *think* all uri's
> will be utf8 encoded. Anyway, I think postgres is doing the right
> thing here.

No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.

> In playing around I did find what I think is a postgres bug. Perl has
> 2 ways it can store things internally. per perldoc perlunicode:
>
> Using Unicode in XS
> ... What the "UTF8" flag means is that the sequence of octets in the
> representation of the scalar is the sequence of UTF-8 encoded code
> points of the characters of a string. The "UTF8" flag being off means
> that each octet in this representation encodes a single character with
> code point 0..255 within the string.
>
> Postgres always prints whatever the internal representation happens to
> be ignoring the UTF8 flag and the server encoding.
>
> # create or replace function chr(i int, i2 int) returns text as $$
> return chr($_[0]).chr($_[1]); $$ language plperlu;
> CREATE FUNCTION
>
> # show server_encoding;
> server_encoding
> -----------------
> SQL_ASCII
>
> # SELECT length(chr(128, 33));
> length
> --------
> 2
>
> # SELECT length(chr(128, 333));
> length
> --------
> 4
>
> Grr that should error out with "Invalid server encoding", or worst
> case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> instead of leaving it as 1). In this case the 333 causes perl store
> it internally as utf8.

Well with SQL_ASCII anything goes, no?

> Now on a utf8 database:
>
> # show server_encoding;
> server_encoding
> -----------------
> UTF8
>
> # SELECT length(chr(128, 33));
> ERROR: invalid byte sequence for encoding "UTF8": 0x80
> CONTEXT: PL/Perl function "chr"
>
> # SELECT length(chr(128, 333));
> CONTEXT: PL/Perl function "chr"
> length
> --------
> 2
>
> Same thing here, we just end up using the internal format. In one
> case it works in the other it does not. The main point being, most of
> the time it *happens* to work. But its really just by chance.
>
> I think what we should do is use SvPVutf8() when we are UTF8 instead
> of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string
> in perls current internal format (maybe unicode, maybe a utf8 byte
> sequence). While SvPVutf8 will always give us utf8 (may or may not be
> valid!) encoded string.
>
> Something like the attached. Thoughts? Im not very happy with the non
> utf8 case-- The elog(ERROR, "invalid byte sequence") is a total
> cop-out yes. But I did not see a good solution short of hand rolling
> our own version of sv_utf8_downgrade(). Is it worth it?
> <plperl_encoding.patch>

Maybe I'm misunderstanding, but it seems to me that:

* String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.

* Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.

I didn't really follow all of the above; are you aiming for the same thing?

Best,

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alex Hunsaker 2010-12-17 04:39:34 Re: plperlu problem with utf8
Previous Message Shigeru HANADA 2010-12-17 02:49:31 Re: SQL/MED - file_fdw