Re: Careful PL/Perl Release Not Required

From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Careful PL/Perl Release Not Required
Date: 2011-02-11 07:43:54
Message-ID: AANLkTimp9yiGqAGLvwJifb1gvJ6xK0PUZh3td30BEU5C@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 10, 2011 at 21:53, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Feb 10, 2011, at 5:28 PM, Alex Hunsaker wrote:

>> The other thing that changed is non UTF-8 databases now also get
>> character semantics. That is we convert from the database encoding
>> into utf8 and visa versa on output. That probably should be noted
>> somewhere...
>
> Oh. I see. And Oleg's database wasn't utf-8 then, I guess. I'll have to re-read the JSON docs, I guess. Erm…feh. Okay. I have to pass the false value to utf8() *now*. Okay, at least that's more consistent.

I'd like to quibble with you over this point if I may. :-)
Per perldoc: JSON::XS
"utf8" flag disabled
When "utf8" is disabled (the default), then
"encode"/"decode" generate and expect Unicode strings ...

So
- If you are on < 9.1 and a utf8 database you want to pass
utf8(false), as you have a Unicode string.

- If you are on < 9.1 and on a non utf8 database you would want to
pass utf8(false) as the string is *not* Unicode, its byte soup. Its in
some _other_ encoding say EUC_JP. You would need to decode() it into
Unicode first.

- If you are on 9.1 and a utf8 database you still want to pass
utf8(false) as the string is still unicode.

- if you are on 9.1 and a non utf8 database you want to pass
utf8(false) as the string is _now_ unicode.

So... it seems you always want to pass false. The only case I can
where you would want to pass true is you are on < 9.1 with a SQL_ASCII
database and you know for a fact the string represents a utf8 byte
sequence.

Or am I missing something obvious?

>> If you do have to change your semantics/functions, could you post an
>> example? I'd like to make sure its because you were hitting one of
>> those nasty corner cases and not something new is broken.
>
> I think that people who have non-utf-8 databases might be surprised.

Yeah, surprised it does the right thing and its actually usable now ;).

>>> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:

> No, he had to add the decode line, IIRC:
>
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
>  use strict;
>  use URI::Escape;
>  utf8::decode($_[0]);
>  return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> Because uri_unescape() needs its argument to be decoded to Perl's internal form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, in a latin-1 database:

Meh, no, not really. He will still need to call decode. The problem is
uri_unescape() does not assume an encoding on the URI. It could be
UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats
not the point, it knows nothing about Unicode or encodings).

For example, lets say you have a latin-1 accented e "é" the byte
sequence is the one byte: 0xe9. If you were to uri_escape that you get
the 3 byte ascii string "%E9":
$ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)'
%E9

If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: e9, len: 1

What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9:
$ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)'
%00%E9

What happens we uri_unescape that? Do we get back a Unicode string
that has one character? No. And why should we? How is uri_unescape
supposed to know what %00%E9 represent? All it knows is thats 2
separate bytes:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: 00e9, len: 2

Now, lets say you want to uri_escape a utf8 accented e, thats the two
byte sequence: 0xc3 0xa9:
$ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)'
%C3%A9

Ok, what happens when we uri_unescape those?:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: c3a9, len: 2

So, plperl will also return 2 characters here.

In the the cited case he was passing "%C3%A9" to uri_unescape() and
expecting it to return 1 character. The additional utf8::decode() will
tell perl the string is in utf8 so it will then return 1 char. The
point being, decode is needed and with it, the function will work pre
and post 9.1.

In-fact on a latin-1 database it sure as heck better return two
characters, it would be a bug if it only returned 1 as that would mean
it would be treating a series of latin1 bytes as a series of utf8
bytes!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Віталій Тимчишин 2011-02-11 08:19:01 Re: Why we don't want hints Was: Slow count(*) again...
Previous Message Noah Misch 2011-02-11 07:13:22 Re: FOR KEY LOCK foreign keys