BUG #8469: Xpath behaviour unintuitive / arguably wrong

From: dennis(dot)noordsij(at)helsinki(dot)fi
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #8469: Xpath behaviour unintuitive / arguably wrong
Date: 2013-09-24 18:43:19
Message-ID: E1VOXZv-0008Q9-0Z@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 8469
Logged by: Dennis
Email address: dennis(dot)noordsij(at)helsinki(dot)fi
PostgreSQL version: 9.3.0
Operating system: FreeBSD 9.2-RC4
Description:

Hi,

After upgrading an 8.1 version to 9.3.0 I am suddenly seeing text fields
containing "&" where they are populated from XML. This may be a
coincidence and the problem may have existed earlier, in any case, now I
noticed.

I extract the text content of XML nodes using xpath, from something like:

<name>Jones &amp; Smith</name>

The reason I end up with "&amp;" is the IMHO rather odd xpath behaviour:

# select xpath('/a/text()', (select xmlelement(name "a", 'A & B')));

xpath
---------------
{"A &amp; B"}

The canonical contents of "a" is "A & B". At first search I've found some
rather heated debates about this with bits of name calling; I certainly do
not want to get into that and I apologize in advance to those who feel very
strongly about this.

I've seen one "fix" describe the problem as:

""DESCRIPTION: Submitter invokes following statement:
SELECT (XPATH('/*/text()', '<root>&lt;</root>'))[1].
He expect (escaped) result "&lt;", but gets "<"
"""

With respect, this "bug" makes no sense as this produces in fact the right
result. The actual value of <root> is "<", it's just escaped when serialized
to XML. If <root> were to actually contain "&lt;", it'd be serialized as
"&amp;lt;". It should not be possible to be blindly cast to a text type, but
explicitly serialized as such.

At least the reviewer at:

http://www.postgresql.org/message-id/201106291934.23089.rsmogura@softperience.eu

agrees, but I don't know what happened with that.

The python lxml implementation based on libxml2 seems to also agree:

>>> from lxml import etree
>>> a = etree.XML("<a/>")
>>> a.text = "A & B";
>>> a
<Element a at 8019eb470>
>>> etree.tostring(a);
'<a>A &amp; B</a>'
>>> a.text
'A & B'
>>> a.xpath('/a')
[<Element a at 8019eb470>]
>>> a.xpath('/a/text()')
['A & B']

and similarly for a simple test using xsltproc when set to output text.

If this really is the intended behaviour or something which can or will not
be changed, then it invites double (un)escaping bugs and so on, and I would
like to ask how you are supposed to sanely extract the intended text from a
node in an XML document without risking double (un)escaping, and whether
everybody else is doing it wrong?

I get that xpath(..text()) apparently wants to return a type XML, that this
is on purpose and that there are certain use cases where you want to treat
the result as a type XML which you could not do if it returned an unescaped
text value, like here:

select xmlelement(name "b", (select (select xpath('/a/text()', (select
xmlelement(name "a", 'A & B'))))[1]));
xmlelement
------------------
<b>A &amp; B</b>

which does not double escape the contents, but where if you cast, it does:

select xmlelement(name "b", (select (select xpath('/a/text()', (select
xmlelement(name "a", 'A & B'))))[1]::text));
xmlelement
----------------------
<b>A &amp;amp; B</b>
(1 row)

I personally don't believe this is very helpful. The escaping is only a
serialization artifact, a text node does not actually contain any &amp;s and
so on.

My first thought is then that casting between text and xml should not even
be possible, and always an explicit (de)serialization to/from text using a
chosen encoding (with a shortcut to a PostgreSQL unicode text type), i.e.
treated similarly to the difference between a unicode string and utf-8
encoded representation , and not the equivalent of blindly casting a byte
sequence to a string and back and hoping for the best.

If xpath(..text()) then absolutely has to return a type XML I would be happy
to explicitly serialize it to a type text, if PostgreSQL would forbid me
from (accidentally) storing a result in my text field I almost certainly did
not intend (the escaped value containing &amp;). Of course my first
preference would be that it would return a type TEXT.

I appreciate any thoughts and workarounds. I don't really want to add xml
unescapes everywhere, that feels like that php method of unescaping a string
until it stops changing. If the user did intend the literal text "&amp;" I
of course want to preserve that.

Many thanks!

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Volberg, Ovsei 2013-09-24 21:14:59 Errors during PostgreSQL 9.2 and 9.3 installation
Previous Message Magnus Hagander 2013-09-24 17:31:21 Re: BUG #8467: Slightly confusing pgcrypto example in docs