Re: plperlu problem with utf8

From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, David Christensen <david(at)endpoint(dot)com>
Subject: Re: plperlu problem with utf8
Date: 2010-12-18 06:53:36
Message-ID: AANLkTinJTePYyVgFwj37dGL_b0xH8RBB4V3-cKzCemjN@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 17, 2010 at 18:04, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:

> Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:
>
>    CREATE OR REPLACE FUNCTION perlgets(
>        TEXT
>    ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
>       my $text = shift;
>       return_next {
>           length  => length $text,
>           is_utf8 => utf8::is_utf8($text) ? 1 : 0
>       };
>    $$;
>
> In a utf-8 database:
>
>    utf8=# select * from perlgets('foo');
>     length │ is_utf8
>    ────────┼─────────
>          8 │ t
>    (1 row)
>
>
> In a latin-1 database:
>
>    latin=# select * from perlgets('foo');
>     length │ is_utf8
>    ────────┼─────────
>          8 │ f
>    (1 row)
>
> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.

Just to reiterate in a different way what David C. said, the flag is
irrelevant in this case. Begin set on that input string is the same as
it not being set.

per perldoc perlunicode:
The "UTF8" flag being on does not mean that there are any characters
of code points greater than 255 (or 127) in the scalar or that there
are even any characters in the scalar. What the "UTF8" flag means is
that the sequence of octets in the representation of the scalar is the
sequence of UTF-8 encoded code points of the characters of a string.
The "UTF8" flag being off means that each octet in this representation
encodes a single character with code point 0..255 within the string.

Basically perl has *2* internal forms and certain strings can be
represented in both.

> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.

Hrm, yeah that seems bogus. Ill have to play with that more.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message tv 2010-12-18 07:10:13 Re: proposal : cross-column stats
Previous Message Alex Hunsaker 2010-12-18 06:46:54 Re: plperlu problem with utf8