XMLATTRIBUTES vs. values of type XML

From: Florian Pflug <fgp(at)phlo(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: XMLATTRIBUTES vs. values of type XML
Date: 2011-07-26 20:44:56
Message-ID: C71079E6-12D8-4048-B8C5-18368936FD5D@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

While reviewing the (now applied) XPATH escaping patches, Radoslaw found one
case where the previous failure of XPATH to escape its return value was offset
by XMLATTRIBUTES insistence to escape all input values, even if they're
already of type XML.

To wit, if you do

SELECT XMLELEMENT(NAME "t", XMLATTRIBUTES('&amp;'::XML AS "a"))

you get

xmlelement
--------------------
<t a="&amp;amp;"/>

which is somewhat surprising. Especially since

SELECT XMLELEMENT(NAME "t", '&amp;'::XML);

gives

xmlelement
--------------
<t>&amp;</t>

as one would except. Now, it seems rather unlikely that a real application
would actually contain a query similar to the former one - you usually don 't
store XML in the attributes of XML nodes. But since XPATH() returns XML, it's
not unlikely that an application would do

SELECT XMLELEMENT(NAME ..., XMLATTRIBUTES((XPATH(...))[1] AS ...))

As it stands, on 9.2 the values returned by the XPath expression will be
escaped twice - once by XPATH and a second time by XMLATTRIBUTES, while on 9.1
the value will be escaped only once.

Since this seems to be the most likely situation in which XMLATTRIBUTES would
receive an input argument of type XML, and since it's not entirely logical for
XMLATTRIBUTES to unconditionally escapes values already of type XML, I propose
to change the behaviour of XMLATTRIBUTES as follows.

Values not of type XML are be treated as they always have been, i.e. all
special characters (<,>,&,",') are replaced by an entity reference (&lt;,
&gt;, &amp;, &quot;, &apos;).

Values of type XML are assumed to be already escaped, so '&' isn't treated as
a special character to avoid double escaping. The other special characters
(<,>,",') are escaped as usual.

The safety of this relies on the fact that '&' may never appear in a
well-formed XML document, except as part of an entity reference. This seems to
be the case, except if CDATA sections are involved, which may contain plain
ampersands. So the actual escaping rule would have to be a bit more complex
than I made it sound above - we'd need to detect CDATA sections, and
re-enabled the escaping of '&' until the end of such a section.

To actually implement this, I'd remove the use of libxml from the
implementation of XMLELEMENT, and instead use our already existing
escape_xml() function, enhanced with the ability to handle the partial
escaping algorithm outlined above. We already only use libxml to escape
attribute values, so doing that isn't a radical departure from xml.c's ways.
As an added benefit, all the encoding-related problems of XMLELEMENT and
XMLATTRIBUTES would go away once libxml is removed from this code path. So
XPATH() would be the only remaining function which breaks in non-UTF-8
databases.

Comments? Thoughts? Suggestions?

best regards,
Florian Pflug

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2011-07-26 21:04:28 Re: isolation test deadlocking on buildfarm member coypu
Previous Message Noah Misch 2011-07-26 20:38:08 Re: sinval synchronization considered harmful