XPATH vs. server_encoding != UTF-8

From: Florian Pflug <fgp(at)phlo(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: XPATH vs. server_encoding != UTF-8
Date: 2011-07-23 15:49:37
Message-ID: AA288A66-0451-4AC4-9DBD-1AFC7892F25D@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

The current thread about JSON and the ensuing discussion about the
XML types' behaviour in non-UTF8 databases made me try out how well
XPATH() copes with that situation. The code, at least, looks
suspicious - XPATH neither verifies that the server encoding is UTF-8,
not does it pass the server encoding on to libxml's xpath functions.

So I created a database with encoding ISO-8859-1 (LATIN1), and did
(which aclient encoding matching my terminal's settings)

CREATE TABLE X (d XML);
INSERT INTO X VALUES ('<r a="ä"/>');

i.e, I inserted the XML document <r a="&auml;"/>, but without using
an entity reference for the german Umlaut-A. Then I attempted to extract
the length of r's attribute "a" with the XPATH /r/@a, both with the XPath
function string-length (which works now! yay!) and with postgres'
LENGTH() function.

SELECT
(XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
FROM X;

The XPATH() function itself doesn't complain, but libxml does - it expects
UTF-8 encoded data, and screams bloody murder when it encounters the
ISO-8859-1-encoded Umlaut-A

ERROR: could not parse XML document
DETAIL: line 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x22 0x2F 0x3E
<r a="ä"/>

That might seem fine on the surface - we did, after all, error out instead
of producing potentially non-sensical results. However, libxml's ability to
detect this error relies on it's ability to distinguish between UTF-8 and
non-UTF-8 encoded strings. Which, of course, doesn't work in the general case.

So for my next try, I deliberately set client_encoding to ISO-8859-1, even
though my terminal uses UTF-8, removed all data from table X, and did

INSERT INTO X VALUES ('<r a="ä"/>');

again. The effect is that is that X now contains ISO-8859-1 encoded data
which *happens* to look like valid UTF-8. After changing the client_encoding
back to UTF-8, the value we just inserted looks like that

<r a="ä"/>

Now I invoked the XPATH query from above again.

SELECT
(XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
FROM X;

As predicted, it doesn't raise an error this time, since libxml is unable
to distinguish the ISO-8859-1 string '<r a="ä"/' from valid UTF-8. But the
result is still wrongs, since the string-length() function counts 'ä' as just
one character, when it reality it are of course contains two.

xpath_length | pg_length
--------------+-----------
1 | 2

The easiest way to fix this would be to make XPATH() flat-out refuse to
do anything if the server encoding isn't UTF-8. But that seems a bit harsh -
things actually do work correctly as long as the XML document contains only
ASCII characters, and existing applications might depend on that.

So what I think we should do is tell libxml that the encoding is ASCII
if the server encoding isn't UTF-8. With that change, the query above
produces

ERROR: could not parse XML document
DETAIL: encoder error

which seems sane. Replacing the data in X with ASCII-only data makes the
error go away, and the result is then correct also.

DELETE FROM X;
INSERT INTO X VALUES ('<r a="a"/>');
SELECT
(XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
FROM X;

gives

xpath_length | pg_length
--------------+-----------
1 | 1

Proof-of-concept patch attached, but doesn't yet include documentation
updates.

Comments? Thoughts? Suggestions?

best regards,
Florian Pflug

Attachment Content-Type Size
xpath_nonutf8.patch application/octet-stream 1.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2011-07-23 16:46:32 Re: XPATH vs. server_encoding != UTF-8
Previous Message Tom Lane 2011-07-23 15:08:52 Re: [GENERAL] Dropping extensions