Re: XPATH vs. server_encoding != UTF-8

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XPATH vs. server_encoding != UTF-8
Date: 2011-07-24 17:51:27
Message-ID: 799546AE-B11C-4718-BA19-A182E796F6C4@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Jul24, 2011, at 01:25 , Florian Pflug wrote:
> On Jul23, 2011, at 22:49 , Peter Eisentraut wrote:
>
>> On lör, 2011-07-23 at 17:49 +0200, Florian Pflug wrote:
>>> The current thread about JSON and the ensuing discussion about the
>>> XML types' behaviour in non-UTF8 databases made me try out how well
>>> XPATH() copes with that situation. The code, at least, looks
>>> suspicious - XPATH neither verifies that the server encoding is UTF-8,
>>> not does it pass the server encoding on to libxml's xpath functions.
>>
>> This issue is on the Todo list, and there are some archive links there.
>
> Thanks for the pointer, but I think the discussion there doesn't
> really apply here.

Upon further reflection, I came to realize that it in fact does apply.

All the non-XPath related XML *parsing* seems to go through xml_parse(),
but we also use libxml to write XML, making XMLELEMENT() and friends
equally susceptible to all kinds of encoding trouble. For the fun of it,
try the following in a ISO-8859-1 database (which client_encoding correctly
set up, so the umlaut-a reaches the backend unharmed)

select xmlelement(name "r", xmlattributes('ä' as a));

you get

xmlelement
-------------------
<r a="&#x4000;"/>

Well, actually, you only get that about 9 times out of 10. Sometimes
you instead get

xmlelement
---------------------------
<r a="&#x4001;\x01\x01"/>

It seems the libxml reads past the terminating zero byte if it's
preceeded by an invalid UTF-8 byte sequence (like 0xe4 0x00 in the example
above). Ouch!

Also, passing encoding ASCII to libxml's parser doesn't prevent it from
expanding entity references referring to characters outside the ASCII
range. So even with my patch applied you can make XPATH() return wrong
results. For example (0xe4 is the unicode codepoint representing umlaut-a)

select xpath('/r/@a', '<r a="&#xe4;"/>'::xml);

gives (*with* my patch applied)

xpath
-------
{ä}

So scratch the whole idea. There doesn't seem to be a simple way to
make the XML type work sanely in a non-UTF-8 setting :-(. Apart from
simple input and output that is, which already seems to work correctly
regardless of the server encoding.

BTW, for the sake of getting this into the archives just in case someone
decides to fix this and stumbles over this thread:

It seems to me that the easiest way to fix XML generation in the non-UTF-8
case would be to cease using libxml for emitting XML at all. The only
non-trivial use of libxml there is the escaping of attribute values, and
we do already have our own escape_xml() function - it just needs to be
taught the additional escapes needed for attribute values. (libxml is
also used to convert binary values to base64 or hexadecimal notation,
but there're no encoding issues there)

best regards,
Florian Pflug

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2011-07-24 17:53:25 Re: pgbench cpu overhead (was Re: lazy vxid locks, v1)
Previous Message Tom Lane 2011-07-24 17:10:36 Re: libpq SSL with non-blocking sockets