Re: XML with invalid chars

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML with invalid chars
Date: 2011-04-27 21:30:47
Message-ID: 20110427213047.GA8223@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 27, 2011 at 03:05:30PM -0400, Andrew Dunstan wrote:
> On 04/26/2011 05:11 PM, Noah Misch wrote:
>> On Mon, Apr 25, 2011 at 07:25:02PM -0400, Andrew Dunstan wrote:
>>> I came across this today, while helping a customer. The following will
>>> happily create a piece of XML with an embedded ^A:
>>>
>>> select xmlelement(name foo, null, E'abc\x01def');
>>>
>>> Now, a ^A is totally forbidden in XML version 1.0, and allowed but only
>>> as "&#x01;" or equivalent in XML version 1.1, and not as a 0x01 byte
>>> (see<http://en.wikipedia.org/wiki/XML#Valid_characters>)
>>>
>>> ISTM this is something we should definitely try to fix ASAP, even if we
>>> probably can't backpatch the fix.
>> +1. Given that such a datum breaks dump+reload, it seems risky to do nothing at
>> all in the back branches.
>
> Here's a patch along the lines suggested by Peter.
>
> I'm not sure what to do about the back branches and cases where data is
> already in databases. This is fairly ugly. Suggestions welcome.

We could provide a script in (or linked from) the release notes for testing the
data in all your xml columns.

To make things worse, the dump/reload problems seems to depend on your version
of libxml2, or something. With git master, a CentOS 5 system with
2.6.26-2.1.2.8.el5_5.1 accepts the ^A byte, but an Ubuntu 8.04 LTS system with
2.6.31.dfsg-2ubuntu rejects it. Even with a patch like this, systems with a
lenient libxml2 will be liable to store XML data that won't restore on a system
with a strict libxml2. Perhaps we should emit a build-time warning if the local
libxml2 is lenient?

> *** a/src/backend/utils/adt/xml.c
> --- b/src/backend/utils/adt/xml.c
> ***************
> *** 1844,1850 **** escape_xml(const char *str)
> --- 1844,1865 ----
> case '\r':
> appendStringInfoString(&buf, "&#x0d;");
> break;
> + case '\n':
> + case '\t':
> + appendStringInfoCharMacro(&buf, *p);
> + break;
> default:
> + /*
> + * Any control char we haven't already explicitly handled
> + * (i.e. TAB, NL and CR)is an error.
> + * If we ever support XML 1.1 they will be allowed,
> + * but will have to be escaped.
> + */
> + if (*p < '\x20')

This needs to be an unsigned comparison. On my system, "char" is signed, so
"SELECT xmlelement(name foo, null, E'\u0550')" fails incorrectly.

The XML character set forbids more than just control characters; see
http://www.w3.org/TR/xml/#charsets. We also ought to reject, for example,
"SELECT xmlelement(name foo, null, E'\ufffe')".

Injecting the check here aids "xmlelement" and "xmlforest" , but "xmlcomment"
and "xmlpi" still let the invalid byte through. You can also still inject the
byte into an attribute value via "xmlelement". I wonder if it wouldn't make
more sense to just pass any XML that we generate from scratch through libxml2.
There are a lot of holes to plug, otherwise.

> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> + errmsg("character out of range"),
> + errdetail("XML does not support control characters.")));
> appendStringInfoCharMacro(&buf, *p);
> break;
> }

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2011-04-27 21:57:28 Re: make world fails
Previous Message Cédric Villemain 2011-04-27 21:14:06 Re: [HACKERS] PostgreSQL Core Team