Re: plperlu problem with utf8

From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: David Christensen <david(at)endpoint(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-18 06:43:48
Message-ID: AANLkTinch0U5CE5B8pNsSVu5bh7eOynsjKpugG4sfG92@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 17, 2010 at 22:32, David Christensen <david(at)endpoint(dot)com> wrote:
>
> On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote:
>
>> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:
>>
>>>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
>>>
>>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?
>>
>> Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl, given a scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on.
>
> This is a correct assertion as to Perl's behavior.  As far as PostgreSQL is/should be concerned in this case, this is the correct handling for URI::Escape,

Right, so no postgres bug here.. Postgres showing é instead of é is
right as far as its concerned.

>> PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode from Perl's internal format on output.
>
> +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments.  In the case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types.  If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardless of data type.

Right and thats what we do for the above. Minus some mis-handling of
non character datatypes like bytea in the UTF-8 case.

> For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably using the built-in conversions before passing it off to the first code path.  A similar handling would need to be done for the return values, again datatype-dependent.

Yeah, thats what we *should* do. Right now we just leave it as byte
soup for the user to decode/encode. :(

> [ correctness of perl character ops in the non utf8 case] One thought I had was that we could expose the server_encoding to the plperl interpreters in a special variable to make it easy to explicitly decode...

Should not need to do anything as complicated as that. Can just encode
the string to utf8 before we hand it off to perl.

[...]
> $ perl -MURI::Escape -e'print length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))'
> 28
>
> $ perl -MEncode -MURI::Escape -e'print length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))'
> 27
[...]
> As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded version is 28.  I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle > 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8 first and then escape the encoded value, and

And why should it? properly escaped URIs should have all those
escaped, I imagine. Anyway not really relevant for postgres.

> b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1).

>>> Either uri_unescape() should be decoding that utf8() or you need
>>> to do it *after* you call uri_unescape().  Hence the maybe it could be
>>> considered a bug in uri_unescape().
>>
>> Agreed.
>
> -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it.

-1? thats basically what I said: "... you need to do it (decode the
utf8) *after* you call uri_unescape"

>  Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}.  This is definitely not a bug in uri_escape, as it is only defined to return octets.

Ahh So -1 because I said maybe you could call it a bug in
uri_unescape(). Really, I was only saying you *might* be able to
consider it a bug-- or perhaps deficiency is a better word, in
uri_unescape iff URI's are defined to have escaped characters as a %
escaped utf8 sequence. I dont know that they do, so I don't know if
its a bug :)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alex Hunsaker 2010-12-18 06:46:54 Re: plperlu problem with utf8
Previous Message Florian Pflug 2010-12-18 06:27:08 Re: proposal : cross-column stats