Re: regexp_replace and UTF8

From: Jasen Betts <jasen(at)xnet(dot)co(dot)nz>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: regexp_replace and UTF8
Date: 2009-02-04 09:59:08
Message-ID: gmbotc$2je$1@reversiblemaps.ath.cx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 2009-02-02, Bart Degryse <Bart(dot)Degryse(at)indicator(dot)be> wrote:
>
> --=__PartF6DE34E1.0__=
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
> Thanks for the ideas!
> The function Jasen suggests works partially.
> There are also entities like
> &#8216;
> &#8217;
> &#8220;
> &#8221;
> &#8230;

these work fine here.

jasen=# select htmlent('&#8216; &#8217; &#8220; &#8221; &#8230;');
htmlent
-----------
‘ ’ “ ” …
(1 row)

> The output of the htmlent function for these looks like |

I see a pipe symbol '|' is that what you wrote?

> Was that what you meant with
>>"characters outside of the LATIN-1 space
>> are not handled but apparently this is what you want." ?

I was under a mistaken impression of the encoding of HTML numeric
entities.

> Because in that case, they should be handled too.
> How should that be done.
> Thanks,

it works here because server_encoding is UTF8;

to check do this.

show server_encoding;show client_encoding;

it looks like you's need to convert the numbers to utt-8 in a bytea
type and then use convert to translate then to your preferred encoding
(appears to be win125x)

postgresql8.3 appears to provide no way to generate UTF-8 in a bytea
type so if you can't use a utf-8 encoding for your database you'll
have to write your own.

>>>> Jasen Betts <jasen(at)xnet(dot)co(dot)nz> 2009-01-31 12:47 >>>
> On 2009-01-30, Bart Degryse <Bart(dot)Degryse(at)indicator(dot)be> wrote:
>>
>> --=3D__Part8EA648F8.0__=3D
>> Content-Type: text/plain; charset=3DUTF-8
>> Content-Transfer-Encoding: quoted-printable
>>
>> Hi,
>> I have a text field with data like this: 'de pati&#235;nt niet'
>> (without the quotes).
>> I would like to convert this string to look like this: 'de pati=C3=ABnt
>> niet'
>> Basically what I need to do (I think) is
>> - get rid of the &, # and ;
>> - convert the number to hex
>> - make a UTF8 from that (thus: \xEB)
> that is not UTF8.
>
> the UTF8 representation for '=C3=AB' in SQL is e'\xC3\xAB' or chr(235)
>
> your input appears to be encoded in LATIN-1 (or possibly 8859-13)
>
>
> I think you'll need to write a function.
>
> here, where I have database encoding UTF8 this appears to work as you
> desire.
>
> CREATE or replace FUNCTION htmlent(inp text) returns text as
> $f$
> DECLARE
> str text;
> BEGIN
> str=3D regexp_replace(quote_literal( inp)
> ,$$&#(\d+);$$
> ,$$'||chr(\1)||'$$=20
> ,'g');
> execute 'select '||str into str;
> return str;
> END
> $f$
> LANGUAGE PLPGSQL;
>
> select htmlent('de pati&#235;nt niet');
>
> probably the above should be expanded to handle named entities=20
> like '&amp;' too.
>
> characters outside of the LATIN-1 space are not handled=20
> but aparently this is what you want.
>
>
> --=20
> Sent via pgsql-sql mailing list (pgsql-sql(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
> --=__PartF6DE34E1.0__=
> Content-Type: text/html; charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
>
><HTML><HEAD>
><META http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8">
><META content=3D"MSHTML 6.00.2800.1264" name=3DGENERATOR></HEAD>
><BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma">
><DIV>Thanks for the ideas!</DIV>
><DIV>The function Jasen suggests works partially.</DIV>
><DIV>There are also entities like</DIV>
><DIV>&amp;#8216;</DIV>
><DIV>
><DIV>&amp;#8217;</DIV>
><DIV>&amp;#8220;</DIV>
><DIV>&amp;#8221;</DIV>
><DIV>&amp;#8230;</DIV>
><DIV>The output of the htmlent function&nbsp;for these looks like | </DIV>
><DIV>Was that what you meant with "characters outside of the LATIN-1 space =
> are not handled</DIV>
><DIV>but apparently this is what you want."&nbsp;?</DIV>
><DIV>Because in that case, they should be handled too.</DIV>
><DIV>How should that be done.</DIV>
><DIV>Thanks,</DIV>
><DIV>Bart&nbsp;</DIV><BR><BR>&gt;&gt;&gt; Jasen Betts &lt;jasen(at)xnet(dot)co(dot)nz&=
> gt; 2009-01-31 12:47 &gt;&gt;&gt;<BR>On 2009-01-30, Bart Degryse &lt;Bart.D=
> egryse(at)indicator(dot)be&gt; wrote:<BR>&gt;<BR>&gt; --=3D__Part8EA648F8.0__=3D<B=
> R>&gt; Content-Type: text/plain; charset=3DUTF-8<BR>&gt; Content-Transfer-E=
> ncoding: quoted-printable<BR>&gt;<BR>&gt; Hi,<BR>&gt; I have a text field =
> with data like this: 'de pati&amp;#235;nt niet'<BR>&gt; (without the =
> quotes).<BR>&gt; I would like to convert this string to look like this: =
> 'de pati=C3=ABnt<BR>&gt; niet'<BR>&gt; Basically what I need to do (I =
> think) is<BR>&gt; - get rid of the &amp;, # and ;<BR>&gt; - convert the =
> number to hex<BR>&gt; - make a UTF8 from that (thus: \xEB)<BR>that is not =
> UTF8.<BR><BR>the UTF8 representation for '=C3=AB' in&nbsp; SQL is =
> e'\xC3\xAB' or chr(235)<BR><BR>your input appears to be encoded in LATIN-1 =
> (or possibly 8859-13)<BR><BR><BR>I think you'll need to write a function.<B=
> R><BR>here, where I have database encoding UTF8 this appears to work as =
> you<BR>desire.<BR><BR>CREATE or replace FUNCTION htmlent(inp text) returns =
> text as<BR>$f$<BR>DECLARE<BR>&nbsp; str text;<BR>BEGIN<BR>str=3D regexp_rep=
> lace(quote_literal( inp)<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
> ;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
> ,$$&amp;#(\d+);$$<BR>&nbsp;&nbsp;&nbsp; ,$$'||chr(\1)||'$$ <BR>&nbsp;&nbsp;=
> &nbsp; ,'g');<BR>execute 'select '||str into str;<BR>return str;<BR>END<BR>=
> $f$<BR>LANGUAGE PLPGSQL;<BR><BR>select htmlent('de pati&amp;#235;nt =
> niet');<BR><BR>probably the above should be expanded to handle named =
> entities <BR>like '&amp;amp;' too.<BR><BR>characters outside of the =
> LATIN-1 space are not handled <BR>but aparently this is what you want.<BR><=
> BR><BR>-- <BR>Sent via pgsql-sql mailing list (pgsql-sql(at)postgresql(dot)org)<BR=
>>To make changes to your subscription:<BR><A href=3D"http://www.postgresql.=
> org/mailpref/pgsql-sql">http://www.postgresql.org/mailpref/pgsql-sql</A><BR=
>></DIV></BODY></HTML>
>
> --=__PartF6DE34E1.0__=--
>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message F. 2009-02-04 21:57:58 postgre2postgre
Previous Message Nikhil teltia 2009-02-03 00:06:08 Function Returning a Set of Composite Value